Hadoop Based Link Prediction Performance Analysis

Size: px

Start display at page:

Download "Hadoop Based Link Prediction Performance Analysis"

Randall Harris
5 years ago
Views:

1 Hadoop Based Link Prediction Performance Analysis Yuxiao Dong, Casey Robinson, Jian Xu Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556, USA Abstract Link prediction is an important problem in social network analysis and has been applied in a variety of fields. Link prediction aims to estimate the likelihood of the existence of links between nodes by the known network structure. The time complexity of link prediction algorithms in huge-scale networks remains unexplored and unsolved, especially for sparse networks. In this project, we will explore how parallel computing speeds up link prediction in huge-scale networks. We implemented similarity based link prediction algorithms based on MapReduce, which have the time complexity of O(n) in sparse networks. We analyzed the performance of our algorithms on the Data Intensive Science Cluster at University of Notre Dame. Weevaluate the performance with different configurations, monitor the resource utilization of the distributed computation, and optimize accordingly. After analyzing the efficiency with different configurations, we present the fastest approach of performing parallelized link prediction, which is particularly suited for real-world big data. Index Terms Social network analysis, Link prediction, MapReduce, Parallelization I. INTRODUCTION Social networks are an important part of our society. These networks are in constant flux and understanding how nodes relate is of great interest. Barabási demonstrated that networks expand continuously by the addition of new vertices which preferentially attach to existing, well connected verticies [1]. Many researchers have studied the network evolution and modeling the dynamic network structure [2], [3], [4]. Link prediction is used to understand and identify the mechanisms of network growth and evolution. Link prediction aims to estimate the likelihood of the existence of links between nodes based on the known network structure information. The classical problem of link prediction is the prediction of existing yet unknown links - called missing links. Most of previous work on link prediction employs cross-validation by splitting the data into two sets: training and testing. Consider this motivating example. People in the real world meet new friends. The relationship is represented by the appearance of a new connection in his or her social graph. Through the new relationship, both people s social circle enlarges. Predicting these relationships before they are formed is vital to the success of a social networking service. Link prediction attempts to address the issue of discovering future connections. We can experience the results of link prediction through the friend recommendation engine on Facebook. However, there are now more than 1 billion users on Facebook. The massive scale is an impediment to successful prediction. A scalable and efficient solution is needed to accurately recommend friends. Challenge: The major challenge of link prediction stems from the sparse, yet gigantic, nature of social networks. A sparse network implies that the existing links between nodes represent a small fraction of the total possible links. The data size requires either years or many computers. To solve the strongly unbalanced data between unexisting links and existing links, we can undersample the holdout test set [5] or only sample negative instances in the test set [6]. Modifying the sampling method changes the data distribution so that it no longer presents the same challenges at the real-world distribution. Since the algorithm no longer reflects the capabilities and limitations of the prediction model, the results are uninterpretable [7]. Thus, parallelization is the only feasable and meaningful method for studying link formation and consequently providing the motivation for our work. Three designed patterns, based on MapReduce, have been proposed to speed up network analysis algorithms[8]. PEGA- SUS is a MapReduce based framework for graph mining which implements most of the classical graph mining algorithms[9]. Although a large amount research has been conducted on MapReduce based graph mining, no MapReduce framework exists for link prediction. Our goal in this project is to design, implement, and analyze the performance of similarity based link prediction algorithms on Data Intensive Science Cluster at University of Notre Dame. The experimental results on several large-scale datasets of variety network types show that the MapReduce based link prediction algorithm is more effective and scalable than traditional ones and its running time decreases with more compute units for appropriately sized data chunks. More work will be done to improve performance and gain further insight into these findings. II. RELATED WORK Our work is related to link prediction and graph mining in huge-scale networks. Link prediction has attracted considerable attention in recent years both from computer science and physics community. Existing work can be classified into two categories: unsupervised methods [10] and supervised methods [7], [11], [12], [13]. Most unsupervised

2 link prediction algorithms are based on a similarity measure between nodes of graph. A seminal work by Liben-Nowell and Kleinberg for unsupervised methods addresses the problem from an algorithmic point of view. The authors investigate how different proximity features can be exploited to predict the occurrence of new links in social networks [10]. For the supervised methods, Lichtenwalter et al. motivated the use of a binary classification framework and vertex collocation profiles [7], [11]. Place features can be exploited into the supervised model for link prediction on location-based social networks [12]. To recommend friends on Facebook, a supervised random walk is designed for link prediction and recommendation [13]. Existing work focuses on link prediction in a particular network without consideration for general parallelized design of large-scale networks. Recently, the focus of graph mining is huge-scale networks. In 2004, Google presented its MapReduce framework for large-scale data indexing and mining [14], which leads the direction of analyzing big data. Three design patterns, bsed on MapReduce, to speed up nework analysis algorithms have been proposed [8]. Moreover, Kang et al. propose two frameworks for huge-scale graph management and analysis: one is GBASE [15]: a scalable and general graph management and mining system based on MapReduce, the other one is a MapReduce based spectral analysis system in billion-scale graphs [16]. Then Yang et al. propose a Self Evolving Distributed Graph Management Environment for partition management of large graphs [17]. The existing work give solutions for general framework for graph mining in large-scale graph. Here we use MapReduce based methods to implement predicting links in huge-scale networks. III. OVERVIEW In this project, we demonstrate how link prediction algorithms benefit from parallel computing. We evaluate the performance of our Hadoop implementation on Data Intensive Science Cluster (DISC) at University of Notre Dame. We design the parallelized strategy for link prediction algorithms using MapReduce model, test the the validity and performance of our MapReduce implementations on DISC, and demonstrate how the number of Mapper tasks and Reducer tasks affect the overall performance. In addition, we explore how graph properties influence the performance of the link prediction problem by testing graphs of different size, type, and clustering coefficient. Finally, we seek to analyze the microscopic details of our implementations. We will monitor the resource utilization of the program, find computation bottlenecks, and attempt to improve our implementation. We will monitor the load balance by comparing completion time in different nodes. Communication time between nodes is an important factor which will also be explored. Big data is divided into small parts for distributed computation and parallel computation can be utilized to reduce the link prediction time. We disover issues that will affect the performance and propose the best approach of performing Fig. 1: The core MapReduce process for our algorithms. parallelized link prediction for big data using DISC. The core parallelized solution for our framework is shown in Figure 1. DISC contains 26 nodes, consisting of 32 GB RAM, 12 x 2 TB SATA disks, 2 x 8-core Intel Xeon E GHz, and Gigabit Ethernet, which is sufficient for big-data manipulation. The software required is Hadoop and the Java runtime environment, both are already installed. IV. EVALUATION SETUP The main thrust of our research is to investigate the performance of link prediction algorithms. Our evaluation approach is divided into three parts. Each part will be focused on reducing the number of variables to explore at the next level down. We will start with a macroscop analysis by treating our link prediction implementation as a black box. The lowest level looks at individual machines and attempts to find performance bottlenecks. We use five data sets as the basis for our evaluation. The data sets range from small (12,000 nodes, 237,000 connections, and 3MB) to large (4.8M nodes, 68M connections, and 1GB). The data sets each represent different types of graphs: citation networks, collaboration networks, social networks, and web networks. See Appendix A for a detailed description of each data set. We analyze the performance at a variety of levels; each providing a unique perspective of the system. To evaluate the macroscopic behavior of our link predictor we wrapped the entire system with a timer; allowing us to obtain a measurement of the completion time of each run. We chose to use this metric for performance because it is the most relevant to an end user. At the next level down, we measure the running time of each hadoop submission. Our implementation consists of ten consecutive Hadoop jobs as demonstrated in Figure 2. The breakdown of time provided at this level allows us to focus our detailed analysis at the next level. Focused testing is important because each evaluation run lasts many hours. The lowest level we are currently exploring is the performance of the most time consuming Hadoop submissions. In this level we analyze the disk i/o, network traffic, and CPU usage. The information gained from these tests will allow us to tune the performance of our implementation to the specific machines we are using.

3 Fig. 2: Steps involved in Link Prediction V. RESULTS In this section, we examine the scalability and efficiency of our MapReduce based link prediction framework from three aspects: overall performance, graph influence, and breakdown performance. A. Overall performance The highest level analyzes the total running time while varying the number of Reducers as well as the data set under observation. Before discussing the performance of our framework, we first analyze the tradeoff of the number of reducers. Fig. 3: Total running time with different number of Reducers, for ND Web data set. Tradeoff. By varying the number of Reducers from 1 to 50 for the ND Web data set, in Figure 3 we see that the average completion time follows three distinct trends. First is the rapid decrease occuring between 1 and 7 reducers. Here we witness the benefits of parallelization: with few reducers the work is under-parallelized. In other words, each Reducer is operating at maximum throughput. The second trend is a steady state where the average time does not increase or decrease. At this stage the benefits of additional Reducers is approximately equal to the additional overhead. On the right portion of the graph, 25 to 50 reducers, we observe an increase in completion time. Here, the amount of data in each chunk is sufficently small, causing the setup time to dominate the total run time. The performance for the ND Web data set has a large variance. We attribute the unpredictable behavior to overhead of disk access and network I/O in this distributed computation platform. This overhead is important for the smaller data sets since the data processing portion of our framework is much shorter than in the large data sets. However, while the overhead of Hadoop based link prediction still exists for a larger data set, the Live Journal, the variance of performance is much smaller, which is shown in Figure 4(a). Because the Live Journal data set is of magnitudes larger than the ND Web one, more time is spent on the actual computing the scores for potential links, thus the variation caused by the overhead of distributed computing is a small percentage of the total run time. As a consequence, our implementation of Hadoop based link prediction has been shown to be suitable for big data. Scalability. We use Live Journal data set as a representative of big data to see how our approach of parallel computing can help speed up the performance. Ideally, if the overhead of distributed computation can be ignored, and the job is evenly distributed to the computing nodes, the job will complete in t = T N = T N 1, where T is the time to complete the job on a single machine, and N is the number of Reducers which is no more than the number of computers. However, as the overhead of distributed computation such as distributing the job to different computers and collecting the results through network does exist, the completion of time cannot be as good as a inversely propotional function, in other words, the power of N cannot be ideally 1, but should lie between 1 and 0. With this observation, we fit the plots with Reducer count smaller than 25 using a power distribution t = T N α, and the power α = , which meets our analysis. Note that 1 < α < 0 delivers a concave curve, the speed-up increases first rapidly then slowly as the number of Reducers increases. This indicates that if we have multiple jobs running on the same distributed computing cluster, for each job we can set a relatively small number of Reducers for optimal overall performance. For example, if we want to simultaneously run link prediction on four data sets that are of similar size as the Live Journal data set, we can set the number of Reducers R = 6 for every job, so that every job sees the benefits of parallelization, while 4 R = 24 < 25 minimizes the overhead of running multiple reducing procedures on a single machine. We also examine the speedup of our algorithm. Theoretically we expect that the number of reducers can directly indicate the degree of concurrence. Thus the theoretical overall throughout by n reducers is denoted as n T, where T is the throughout by one reducer. In Figure 4(b), we plot speedup of our algorithms. Basically, we find our algorithm shows a much better speedup than linear speedup. The experiments on Live Journal network indicate that our MapReduce based link prediction framework has excellent scalability. B. The impact of graph properties Besides the Reducer parameter, we also investigate the impact of graph properties on computation time in the highest level. We control the variables by using fixed 25 Reducers and analyze the performance on small, medium and large networks, shown in Table I.

4 (a) Total running time for Live Journal dataset. (b) Speedup on Live Journal dataset. Fig. 4: Running time and speedup. TABLE I RUNNING TIME ON DIFFERENT DATA SETS WITH 25 REDUCERS Data set Time (s) Nodes Edges HepPh Collaboration 94.9 ± ND Web 1089 ± LiveJournal 6818 ± Graph Size. From Table I we see that the time consumed for different graphs is proportional to the number of nodes in the graph, or we can say, our approach has a time complexity of O(N). The reason is that the link prediction algorithm we used is based on common neighbors. As a consequence, if two nodes do not have a common neighbor, the score of connection between these two nodes must be 0. In other words, if node a and b have a common neighbor c, then a and b must simultaneously exist in Adj[c], otherwise a and b will never coexist in the adjacency list of any node. With this fact, we only have to calculate the scores of connections in the adjacency list, rather than to calculate the scores of connections between every two nodes. Time complexity analysis. The observation that we do not have to compute scores for every potential connection significantly reduces the time complexity of our approach. If the average degree of nodes in the network is k = 2E/N, then for the adjacency list of every node, the number of pairs to deal with is k(k 1) 2, and the total numbers of scores to compute for the whole network is k(k 1)N 2. As a result, the time complexity for link prediction based on common neighbors is O(Nk 2 ), and space complexity O(N k). Barabási and Albert proposes the power law distribution of degrees in real-life networks [1], i.e., the probability that a node has a degree of k is inversely propotional to k. As a consequence, the degrees of most nodes in empirical networks are small, giving us a small average degree. Therefore, for reallife networks which are big but sparse, the time complexity of link prediction based on common neighbors is O(N). In our Hadoop based link prediction implmentation, since the data will be distributed to multiple processing units, the time complexity is O(N/U), where U is the number of computing nodes. C. Breakdown of jobs To analyze which procedures take the most time in our implementation, we breakdown the jobs for a more insightful performance analysis. Our implmentation consists of ten consecutive Hadoop jobs which are shown in Figure 2. Again we control the variables by using fixed 25 Reducers and analyze the performance of the ten consecutive procedures for small, medium and large networks, as shown in Figure 5(a), 5(b), 5(c). Heavyweight jobs. In all three of these breakdown graphs, the seventh procedure in light blue, getlpscore, and the tenth procedure in light red, getauc, occupy the majority of the time. The prominence of these two procedures is within expectation: getlpscore is the procedure that actually computes the scores for potential connections with the algorithm based on common neighbors. As analyzed above, the time complexity of getlpscore is O(N). The reason that getauc takes another big share of total time is different from that of getlpscore. In this last step, the scores of potential links that are stored on the 25 machines need to be transfered via the network to the controlling machine, during which the overhead of disk operations and network communication is heavy and non-negligible. Also, after collecting and merging the results from those 25 machines, the calculation of AUC score must be completed on the controlling machine. The calculation of AUC score further makes the procedure getauc under-parallelized, therefore this step grows in superlinear time against the network size. In a word, the larger the network, the more proportion the last step getauc will take. Procedures 1 6 can be completed in constant time, such as splitdata, or in sublinear time, so getlpscore takes a larger proportion of time compared with procedures 1 6 when dealing with bigger data sets. Lightweight jobs. Based on the analysis above, procedures 1 6 can be completed in constant time or in sublinear

5 time, which coincides with the actual experiment running time. We can read more from the breakdown of jobs. For example, in data sets HepPh Collaboration and Live Journal, getadjlist (Procedure 6, the grey bar) is slightly higher than getdegreestats (Procedure 6, the purple bar), while in ND Web, getadjlist is noticeably lower than getdegreestats. This phenomenon is due to time complexity of getadjlist is O(E), while the complexity of getdegreestats is O(N). As the average degree is proportional to E/N, we can infer from the breakdown that the average degree in ND Web data set is smaller than that of the other two data sets. Our inference can be verified that the average degree of nodes of ND Web data set ( 4.6) is indeed less than that of HepPh Collaboration ( 19.7) and that of Live Journal ( 14.2). As we can see, the breakdown graphs reflects the nature of algorithms we used as well as the intrinsic properties of networks. Last but not least, the breakdown graphs confirm our approach works better on big data sets than on small ones. The error bars in the median-sized network ND Web and in the large-sized network Live Journal are good, while in the small-sized network HepPh Collaboration the variation is not acceptable. Also, the actual two steps of performing link prediction (Procedure 7 and 10) in the small data set comprises of less than half (44%) of the total time, indicating the efficiency of Hadoop based link prediction on small data sets is unsatisfactory, while on large networks such as Live Journal the efficiency is much higher (85%). VI. CONCLUSION AND FUTURE WORK In order to improve the performance we need to dive deeper into the implementation. We will add instrumentation designed to study specific elements of the system including: the I/O and network bandwidth of Hadoop distributed file system (HDFS), and the computing performance of CPU in different nodes. We will use the data to create a breakdown of time spent and find bottlenecks. Once the bottlenecks are found we will determine possible methods for reducing the effect, consequently improving performance. To evaluate our hypothesis that the I/O and communication among distributed computing nodes is the main bottleneck, we will reduce the CPU availability and measure the effect on total run time. Though we have been running jobs on DISC day and night for more than one month and collected considerable amount of data, we do not have statistically significant data for the Live Journal data set at most configurations. Performing link prediction on big data is not easy by nature, especially for the Live Journal data set which is larger than 1GB. We will keep running jobs on DISC clusters continuously and collect as much as data as possible. REFERENCES [1] A.-L. Barabási and R. Albert, Emergence of Scaling in Random Networks, Science, [2] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, Group formation in large social networks: membership, growth, and evolution, in KDD 06, 2006, pp [3] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins, Microscopic evolution of social networks, in KDD 08, 2008, pp [4] J. E. Hopcroft, T. Lou, and J. Tang, Who will follow you back? reciprocal relationship prediction, in CIKM 11, [5] S. S. Mohammad AI Hasan, Vineet Chaoji and M. Zaki, Link prediction using supervised learning, in Workshop on LACS of SDM 06, 2006, pp [6] C. Wang, V. Satuluri, and S. Parthasarathy, Local probabilistic models for link prediction, in ICDM 07, 2007, pp [7] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, New perspectives and methods in link prediction, in KDD 10. ACM, [8] J. Lin and M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in MLG 10, [9] U. Kang, C. E. Tsourakakis, and C. Faloutsos, Pegasus: A peta-scale graph mining system implementation and observations, in ICDM 09, [10] D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks, in CIKM 03. ACM, [11] R. N. Lichtenwalter and N. V. Chawla, Vertex collocation profiles: subgraph counting for link analysis and prediction, in WWW 12, [12] A. Scellato, Salvatore. Noulas and C. Mascolo, Exploiting place features in link prediction on location-based social networks, in KDD 11. ACM, [13] L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social networks, in WSDM 11, 2011, pp [14] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, in OSDI 04, 2004, pp [15] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos, GBASE: a scalable and general graph management system, in KDD 11, [16] U. Kang, B. Meeder, and C. Faloutsos, Spectral analysis for billionscale graphs: discoveries and implementation, in PAKDD 11, [17] S. Yang, X. Yan, B. Zong, and A. Khan, Towards effective partition management for large graphs, in SIGMOD 12, APPENDIX The data used for evaluation are publicly available at Stanford Network Analysis Project (SNAP). 1 In the ND Web data set, nodes represent pages from University of Notre Dame (domain nd.edu) and directed edges represent hyperlinks between them. The data was collected in 1999 by Albert, Jeong and Barabasi. Live Journal is a free on-line community with almost 10 million members; a significant fraction of these members are highly active. (For example, roughly 300,000 update their content in any given 24-hour period.) Live Journal allows members to maintain journals, individual and group blogs, and it allows people to declare which other members are their friends they belong. Arxiv HEP-PH collaboration (High Energy Physics - Phenomenology) network is from the e-print arxiv and covers scientific collaborations between authors papers submitted to High Energy Physics - Phenomenology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes. The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arxiv, and thus represents essentially the complete history of its HEP-PH section. Arxiv HEP-PH citation (high energy physics phenomenology ) graph is from the e-print arxiv and covers all the citations within a dataset of 34,546 papers with 421,578 edges. If a paper i cites paper j, the graph contains a directed edge 1

6 (a) HepPh Collaboration dataset (b) ND Web dataset (c) Live Journal dataset Fig. 5: Running time with 25 Reducers. from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arxiv, and thus represents essentially the complete history of its HEP-PH section. The data was originally released as a part of 2003 KDD Cup. Epinions social network is a who-trust-whom online social network of a a general consumer review site Epinions.com. Members of the site can decide whether to trust each other. All the trust relationships interact and form the Web of Trust which is then combined with review ratings to determine which reviews are shown to the user.

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization