Sampling Operations on Big Data

Size: px

Start display at page:

Download "Sampling Operations on Big Data"

Dwain Bates
6 years ago
Views:

1 Sampling Operations on Big Data Vijay Gadepally, Taylor Herr, Luke Johnson, Lauren Milechin, Maja Milosavljevic, Benjamin A. Miller Lincoln Laboratory Massachusetts Institute of Technology Lexington, MA {vijayg, taylor.herr, luke.johnson, lauren.milechin, maja.milosavljevic, Abstract The 3Vs - Volume, Velocity and Variety - of Big Data continues to be a large challenge for systems and algorithms designed to store, process and disseminate information for discovery and exploration under real-time constraints. Common signal processing operations such as sampling and filtering, which have been used for decades to compress signals are often undefined in data that is characterized by heterogeneity, highdimensionality, and lack of known structure. In this article, we describe and demonstrate an approach to sample large datasets such as social media data. We evaluate the effect of sampling on a common predictive analytic, link prediction. Our results indicate that greatly sampling a dataset can still yield meaningful link prediction results. I. INTRODUCTION The volume of information passing through today s data systems often outpaces our ability to compute meaningful results. Storing all of this data quickly stresses the limits of databases and storage systems. In the case of predictive analytics, where data is stored in memory, this increase in data volume affects the ability to perform meaningful computations. Up-front data reduction, such as sampling or filtering data, can help alleviate some of these issues. However, these data reduction techniques may impact the quality of results. The focus of this study is to explore the effects of a variety of sampling techniques on a specific predictive analytic and type of data: link prediction on large graphs. Sampling techniques on large graphs have been investigated in numerous studies. However, the performance of these techniques have not been measured in the context of predictive analytics. For an introduction to the variety of sampling techniques and how they effect various graph features such as degree distributions and singular value distributions, we refer the reader to [1]. In [2], the authors present a series of sampling techniques to perform frequent subgraph mining. Sampling techniques such as random walk and jump methods are discussed in [3] and have been applied to subgraph mining. Vertex-based sampling methods, such as random node sampling [1], [2] or wedge sampling [4], have also been shown to preserve various graph features including the graph clustering coefficient. Distribution A: Public Release. This work is sponsored, by the Assistant Secretary of Defense for Research & Engineering, under Air Force Contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Of growing interest in the graph community is the application of anticipatory analytics to large graphs such as those that describe social or computer networks. One particularly interesting anticipatory analytic is link prediction [5], where we predict whether two entities in a graph will form a relationship in the near future. Link prediction can allow analysts to anticipate new connections and react in a proactive rather than reactive manner. In [6], the authors describe link prediction in the context of web pages. In [7], the authors describe link prediction on social networks. Link prediction is a computationally heavy task and a motivating example for the need of up-front data reduction in large graphs and datasets. In this study, we evaluate the application of link prediction techniques to a large sampled social media graph derived from Twitter. The article is structured as follows. We first describe the parallels we have noticed between signal processing and big data in Section II, followed by a description of the analytic environment D4M in Section III. We then describe the types of sampling methods and link prediction methodology in Section IV. Section V describes the evaluation framework and results obtained of applying link prediction techniques on sampled datasets. We conclude the article in Section VI. II. SIGNAL PROCESSING AND BIG DATA Traditional signal processing deals with operations on multidimensional analog or digital signals often represented as arrays of floating point or integer values. Traditional signal sources include radars, sensors such as GPS or LIDAR, imagery and video. Very often, a signal processing analysis pipeline attempts to perform signal acquisition and reconstruction for signals that are sampled or in which there is missing data. To support detection algorithms, one may use techniques to improve the signal to noise ratio (SNR) or signal quality. A signal processing pipeline may also perform operations to extract features or patterns that can be used to characterize new signals or extract signals of interest. Often, a combination of techniques such as signal compression, domain transformations and signal reconstruction steps are used to do these operations. Big Data analytics, often characterized by analytics applied to datasets that strain available resources, also support similar operations. Unlike traditional signal processing algorithms, however, these operations are performed on large quantities of structured and/or unstructured datasets. Some of the common

2 operations include error and anomaly detection to determine corrupted portions of a dataset, data compression, clustering or community detection, pattern recognition and model construction. Fundamentally, these operations are similar to signal processing operations but require the ability to find a suitable representation for data that can be used across a wide variety of datasets as general as numeric arrays. Correlating heterogenous sources of information requires a set of tools, data structures and mathematical foundations. III. D4M AND ASSOCIATIVE ARRAYS D4M, or the Dynamic Distributed Dimensional Data Model, is a software package developed at MIT Lincoln Laboratory that enables researchers to prototype solutions and analytics for large datasets. The D4M package contains support for a mathematical data type called associative arrays, a schema to convert structured and unstructured datasets to associative arrays, and a set of software tools to support mathematics on associative arrays and database operations. A. Associative Arrays Associative arrays are used to describe the relationship between multidimensional entities using numeric/string keys and numeric/string values. Associative arrays provide a generalization of sparse matrices. Formally, an associative array A is a map from d sets of keys K 1 K 2... K d to a value set V with a semi-ring structure A : K 1 K 2... K d V where (V,,, 0, 1) is a semi-ring with addition operator, multiplication operator, additive-identity/multiplicativeannihilator 0, and multiplicative-identity 1. Furthermore, associative arrays have a finite number of non-zero values which means their support supp(a) = A 1 (V \{0}) is finite. While associative arrays can be any number of dimensions, a common technique to use associative arrays in databases is to project the d-dimensional set into two dimensions as in: A : K 1 {K 2 K 3... K d } V where the operation indicates a union operation. In this 2D representation, K 1 is often referred to as the row key and {K 2 K 3... K d } is referred to as the column key. A more thorough description of associative arrays can be found in [8] and examples of associative operations in [9]. B. D4M Schema The D4M schema [10], provides a four table solution that can be used to represent many datasets of interest. Each table in this schema is represented by an associative array as described in the previous section. The schema allows arbitrary datasets which may be structured or unstructured to be represented in a common format so that linear algebraic operations can be applied on the data. The D4M schema converts a dense dataset into a sparse associative array by exploding each data entry into an element of an associative array where each unique column-value pair is a column. This schema has been applied to a wide variety of datasets such as HPC scheduler logs [11], social media, and medical data [12]. C. D4M Software Tools Associative arrays have a one-to-one relationship with many database data models such as key-value stores. The schema described in the previous section can convert an arbitrary dataset into an associative array that resembles key-value pairs. Since a majority of large datasets are indexed and stored in databases, D4M supports connectivity with a variety of back-end database technologies. Operations on database tables (which are represented as remote associative arrays) are similar to operations on local associative arrays. Currently, the D4M environment is supported in MATLAB and GNU Octave. We are currently developing a binding for the Julia language [13]. A more thorough description of the D4M software environment is provided in [12]. IV. SAMPLING AND LINK PREDICTION In this section, we describe the sampling methods applied to large graphs and the techniques used for link prediction. A. Sampling Methods To best cover the different styles of sampling, we have selected sampling methods that fall into a number of categories. These include edge sampling methods where edges are selected by a predetermined criteria; snowball sampling methods where algorithms start with a small set of vertices and expand until a target graph size is reached; exploration algorithms where an initial set of vertices and edges are selected and random walk techniques are applied to determine the sampled graph; and vertex sampling methods where a set of vertices are selected and the sample graph contains edges incident on these vertices. Specifically, we used the following sampling methods for a sampling factor S being applied to a graph G: Random Edge Sampling: In this method, any edge in the graph is kept with a probability p where p is 1/S. Thus the overall sampled graph size will be approximately G p. Popularity Based Sampling: In this method, an edge is kept based on the inverse of the popularity, or degree, of the vertices it connects. Specifically, the probability of an edge being kept after sampling is proportional to 1/ d i d j where d i and d j correspond to the degrees of vertices v i and v j respectively. Random area sampling: In this sampling method, we begin with a random set of seed vertices. A vertex is chosen by the probability p/d avg where d avg is the average degree of a vertex in the graph and p is the inverse of the sampling factor S. The sampled graph consists of all edges and vertices connected to the selected vertices. Random Jump Sampling: In this method, we randomly select a vertex v in our graph. Until we reach the desired number of edges (approximately the number of edges in G divided by S), we randomly select an edge connected

3 to v, and continue or we jump to another vertex selected from a uniform distribution with probability p jump. Commonly, the value of p jump is selected to be Wedge Sampling: In this method, we randomly select a number of vertices from a uniform distribution. To create a sampled wedge, we select two edges uniformly that are incident on the selected vertices. B. Link Prediction We perform link prediction using a number of similarity metrics applied to graphs. In such methods we compute statistics of a graph up to a certain time point and score all possible non-observed edges in the graph. Predicting new edges can be done by selecting those new edges which satisfy a similarity threshold. In this study, we selected the top- N scored edges instead. The value of N is determined by computing the increase in the size of the graph from time step t to time step t + 1. Thus, N = G(t + 1) G(t), where G(t) is the number of edges in G at time t. Suppose that at time t vertices v i and v j are not connected. We can compute the similarity, Sim(v i, v j ), between v i and v j as proportional to the number of common features they share. These features can be as local similarity metrics, global similarity metrics, or quasi-local metrics [5]. In our studies, we have found that local similarity metrics perform well [14]. The first local similarity metric is the common neighbors statistic. Common neighbors is a statistic of the number of neighbors two vertices have in common and can be found by computing the inner product of a graph. The insight behind this method is that two vertices with many common neighbors have a high likelihood of being connected in the future. Using this measure, we can compute the similarity, Sim(v i, v j ), between v i and v j as Sim(v i, v j ) = (Γ(v i ) Γ(v j )), where Γ(v i ) and Γ(v j ) indicate neighbors of v i and v j respectively. The second similarity metric is the Jaccard coefficient. Jaccard is an extension of the common neighbors method and normalizes the common neighbors score by taking into account the degrees of each vertex. The third similarity metric is the lowrank approximation. In this method, we compute a rank-k matrix approximation of the original graph adjacency matrix through a singular value decomposition of the original matrix. Potential new edges are then scored on the inner product of the low-rank approximation. This method is akin to the common neighbors method being applied to a low-rank approximation of the original adjacency matrix. The final similarity metric used in this study is the Adamic Adar measure [15]. The Adamic Adar similarity metric increases the likelihood that an edge is predicted when a non-observed edge connects two vertices that are connected to a number of unique vertices. For example, in a social media graph we would rate two users closer together if they were both connected to a unique user or set of users. The similarity of v i and v j using this measure is given as: Sim(v i, v j ) = z Γ(v i) Γ(v j) 1 log 10 (Γ(z)) where Γ(v i ) and Γ(v j ) refer to vertices who are neighbors of v i and v j respectively. In the following section, we describe the evaluation framework and results. V. EVALUATION AND RESULTS In this section, we describe the evaluation framework and results obtained when performing link prediction on sampled data. Twitter data was collected from the public API and converted to associative arrays using the D4M tools described in Section III. A. Evaluation Framework Each collected graph is temporally separated into 10 training steps and 5 testing steps. We apply the different sampling techniques to the training graphs and evaluate the quality of links predicted when using the different similarity metrics discussed in Section IV-B. Details of the process are discussed below: 1) We first sample the original graph using the techniques described in Section IV-A. Sampling is performed on the integrated graph over 10 time periods. For each of the five sampling methods, we sample the graphs at rates where the number of edges is reduced by a power of two from 1/2 to 1/64. In the random jump algorithm, we set the number of seed vertices to be approximately 0.15 E, where E is the number of edges in the sampled graph. 2) We perform link prediction using the similarity metrics discussed in the previous section on the sampled graphs. This process first takes the sampled graphs and evaluates the different prediction metrics for new potential nonobserved links, and then predicts the top-n scored links. In the low-rank approximation, we use a rank 10 approximation of the original adjacency matrix of the graph, which is computed via a singular value decomposition. 3) We evaluate the predictive performance by comparing the predicted links against the ground truth determined from the testing data set. Predictive performance is calculated after controlling for edges that are removed by sampling and is limited to non-observed edges. 4) We evaluate the probability of detection and false alarm for predicted links at varying future time steps. Each sampling and link prediction method runs on a square, symmetric adjacency matrix representation of the graph. The schema of Section III-B stores Twitter data in an incidence matrix, E. In order to construct the adjacency matrix for each time step, we first extract all edges that occur during that time step (an edge is a row in the incidence matrix). The incidence matrix contains many vertex types in its columns, such as username, location, time, and words used in the Tweet. Construction of the adjacency matrix from the incidence matrix can be done by the relation A = W T L where A is the adjacency matrix and W and L are incidence matrix representations of a subset of the columns in E. In order to make the adjacency matrix square, we simply add the

Fig. 1. Generating the adjacency matrix from the incidence matrix representation of the dataset. First, a subset of columns we wish to sample is extracted from E, denoted by L and W.

4 Fig. 1. Generating the adjacency matrix from the incidence matrix representation of the dataset. First, a subset of columns we wish to sample is extracted from E, denoted by L and W. Next, the adjacency matrix is created as A = W T L. Finally, the adjacency matrix is made square by adding the transpose. transpose of A to itself such that A = A + A T. This process is described in Figure V-A. Most sampling methods on an adjacency matrix can be realized by randomly selecting rows or column labels, and then selecting entries in those rows or columns. For example, a random area sample is performed by selecting random row/column indices, and then including the entire row and column with those indices. Graph generation, sampling and prediction were done at the Massachusetts Institute of Technology Lincoln Laboratory using the MIT SuperCloud architecture [16]. The system consists of approximately 300 nodes with dual socket 16-core AMD processors and 128 GB of RAM per node. B. Results and Discussion The probability of detection and false alarm rates of predicted links allows us to quickly determine the quality of predicted edges. In presenting the results, we use the common area under the receiver operator characteristics curve (AUC). In this measurement, an AUC of 1 indicates perfect detection and an AUC of 0.5 indicates no discrimination power. For each of the 10 instantiations of the graph, performance is averaged over all samples. In addition, the error bars in the plots show performance at the 25th and 75th percentiles. The results from performing link prediction are shown in Figure 2. From the results, prediction performance tends to decrease at a higher sampling rate, as expected. However, there are some interesting results observed. Random edge sampling and random walk sampling consistently perform better than other sampling methods. We believe that this may be due to the fact that both of these methods favor the selection of edges connected to vertices with high degree. This notion is reinforced by the relatively low performance of the popularity based sampling approach which essentially normalizes the degree distribution of the vertices. For all sampling methods, the power of link prediction greatly degrades at the 1/64 rate of sampling to nearly an AUC of 0.5. This is likely due to the high number of isolated vertices. For both random sampling and random walk sampling, the AUC remains above 90% until approximately a sampling factor of 25%. This result indicates that nearly 75% of a dataset can be removed with a 10% reduction in overall link prediction performance. Such a reduction may be very helpful for big data systems where approximate prediction results are sufficient. The results for link prediction through common neighbors, Jaccard coefficient and Adamic Adar are very similar at all sampling levels. This is likely due to the fact that potential new edges are scored by a similarity metric that is based on the number of common neighbors. Link prediction through the low rank approximation method, however, has vastly different results. This is likely due to the rank-10 approximation. Changing the rank should improve these results. There are also interesting aspects of the running time results. For all the neighbor based sampling methods, random edge sampling seems to reduce in running time at a rate slower than the other sampling techniques. This may be due to the smaller number of isolated vertices left behind in the random sampling method. The low-rank approximation has very strange timings which we believe may be due to the different sampling methods effect on the distribution of eigenvalues. This change in eigenvalue distribution may cause the eigensolver to converge at very different rates under different sampling methods. VI. CONCLUSIONS In this article, we demonstrated applying and evaluating the effect of sampling methods on link prediction on a real social media graph derived from Twitter. From our experiments, we believe that sampling large datasets has significant potential to greatly reduce the run time and computational footprint of link prediction algorithms while still maintaining adequate prediction performance. As future work, we are interested in expanding the signal processing and big data parallels by incorporating filtering operations. Further, we would like to evaluate the effect of sampling on other types of analytics and look at whether it is possible to merge faster, less accurate samples such as those at higher sampling rates. ACKNOWLEDGMENT The authors wish to thank Jeremy Kepner and Albert Reuther for their assistance in developing the sampling and link prediction concepts. The authors also wish to thank the LLGrid team at MIT Lincoln Laboratory for their support in performing the experiments. REFERENCES [1] J. Leskovec and C. Faloutsos, Sampling from large graphs, in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp [2] R. Zou and L. B. Holder, Frequent subgraph mining on a single large graph using sampling techniques, in Proceedings of the eighth workshop on mining and learning with graphs. ACM, 2010, pp [3] M. Al Hasan and M. J. Zaki, Output space sampling for graph patterns, Proceedings of the VLDB Endowment, vol. 2, no. 1, pp , [4] C. Seshadhri, A. Pinar, and T. G. Kolda, Triadic measures on graphs: The power of wedge sampling, in Proceedings of the SIAM Conference on Data Mining (SDM). SIAM, [5] L. L u and T. Zhoua, Link prediction in complex networks: A survey, Physica A: Statistical Mechanics and its Applications, vol. 390, no. 6, pp , [6] B. Taskar, M. fai Wong, P. Abbeel, and D. Koller, Link prediction in relational data, in Advances in Neural Information Processing Systems 16, ser , S. Thrun, L. Saul, and B. Schölkopf, Eds. MIT Press, 2004.

5 Fig. 2. Performance and time measurements for various link prediction methods applied on sampled Twitter data. Each row describes the results for common neighbors, Jaccard s coefficient, low rank, and Adamic adar methods. The first column describes the area under the ROC curve (AUC) for predicting links one time step out. The second column describes the AUC for predicting links five time steps out. The third column describes the overall time taken to perform link prediction under various sampling techniques. [7] D. Wang, D. Pedreschi, C. Song, F. D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A.-L. Barabasi, Human mobility, social ties, and link prediction, in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2011, pp [8] J. Kepner, J. Chaidez, V. Gadepally, and H. Jansen, Associative arrays: Unified mathematics for spreadsheets, databases, matrices, and graphs, New England Database Day, [9] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and J. Kepner, Graphulo: Linear algebra graph kernels for nosql databases, in Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International. IEEE, 2015, pp [10] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz et al., Dynamic distributed dimensional data model (d4m) database and computation system, in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp [11] V. Gadepally and J. Kepner, Big data dimensional analysis, in Proc. IEEE High Performance Extreme Computing, [12] V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen et al., D4m: Bringing associative arrays to database engines, IEEE High Performance Extreme Computing, [13] J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman, Julia: A fast dynamic language for technical computing, arxiv preprint arxiv: , [14] L. Edwards, L. Johnson, M. Milosavljevic, V. Gadepally, and B. A. Miller, Sampling large graphs for anticipatory analytics, IEEE High Performance Extreme Computing, [15] L. A. Adamic and E. Adar, Friends and neighbors on the web, Social networks, vol. 25, no. 3, pp , [16] A. Reuther, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, M. Hubbell, P. Michaleas, J. Mullen, A. Prout et al., Llsupercloud: Sharing hpc systems for diverse rapid prototypingsupercloud: Sharing hpc systems for diverse rapid prototyping, in High Performance Extreme Computing Conference (HPEC), 2013 IEEE. IEEE, 2013, pp. 1 6.

Sampling Large Graphs for Anticipatory Analysis

Sampling Large Graphs for Anticipatory Analysis Lauren Edwards*, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller IEEE High Performance Extreme Computing Conference September 16, 2015