Improvement of path analysis algorithm in social networks based on HBase

Size: px

Start display at page:

Download "Improvement of path analysis algorithm in social networks based on HBase"

Kenneth Hampton
5 years ago
Views:

1 J Comb Optim (2014) 28: DOI /s z Improvement of path analysis algorithm in social networks based on HBase Yan Qiang Bo Pei Weili Wu Juanjuan Zhao Xiaolong Zhang Yue Li Lidong Wu Published online: 17 November 2013 Springer Science+Business Media New York 2013 Abstract When social network has reached hundreds of million users, the analysis of data in social network services becomes very important. Understanding how nodes interconnect in large graphs is an essential problem in many fields. In order to find connecting nodes between two nodes or two groups of source nodes in huge graphs, we propose a parallelized data-mining algorithm to get the shortest path between nodes in a social network based on HBase distributed key/value store. Our algorithm Y. Qiang B. Pei J. Zhao W. Wu X. Zhang (B) Y. Li College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China xiaolong.zhang@gmail.com Y. Qiang qiangyan@tyut.edu.cn B. Pei peibo0332@tyut.edu.cn J. Zhao zhaojuanjuan@tyut.edu.cn98 Y. Li ly0158@tyut.edu.cn W. Wu L. Wu Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA weiliwu@utdallas.edu L. Wu Lidong.wu@utdallas.edu X. Zhang College of Information Sciences and Technology, Pennsylvania State University, State College, PA, USA

2 J Comb Optim (2014) 28: can achieve the shortest path among different nodes in network under the parallel environment. We analyze the social network model by this algorithm first, and then optimize the output from cloud platform by using the intermediary degrees and degree central algorithm. Finally, with a simulated social network, we validate the efficiency of the proposed algorithm. The experiment results indicate that our algorithm can improve the efficiency of parallel breath-first search (BSF). Keywords Social networks HBase Parallel BFS The K-shortest paths Intermediary degrees 1 Introduction Social network services (SNS) make it possible to connect people who share interests and activities across political, economic, and geographic borders. We can interact with each other through virtual connections. However, it is not easy to analyze the SNS data. In particular, the increase of users will accelerate the data scale of popular SNS, such as Facebook, and this posts challenges for effective data analysis. Traditional data analysis methods often target small-scale datasets, and are inefficient in handling large-scale datasets, such as SNS data. For instance, in a social network, to know how any two nodes are connected and how close they are, we have to compute the shortest path of these two nodes. Due to complex internal dependency of network nodes, tradition methods based on the relational database model will not be suitable for large social networks. To address this problem, researchers have proposed some new methods. We promote the NoSQL (Lakshman and Malik 2010) and Graph database (GDB) to solve this problem. Different from traditional relational database, these methods can store the topological structure of a network, which is used to compute the shortest path. Recently, many researchers begin to pay attention to cloud database techniques. Google s Bigtable (Chang et al. 2006) and Apache HBase (Taylor 2010) are the representative of cloud databases. HBase is an open source, nonrelational, column-oriented, distributed database. It is developed as part of Apache Software Foundation s Apache Hadoop (Škrabálek et al. 2013) project and runs on top of hadoop distributed filesystem (HDFS). Google s Bigtable is also a structured distributed data storage system built on Google File System. They both can handle large datasets. Based on HBase, we propose an algorithm that combines ranking methods and data clustering. First, we construct a set of paths between any two nodes in a social network, and then rank the paths based on the importance of nodes and user requirements. The rest of this paper is structured as follows. Section 2 provides pervious work on shortest path. Section 3 describes path algorithms based on HBase, includes HBase, parallel BFS to strike a single-source shortest path tree, K-shortest paths and optimization sorting in more details respectively. We describe the experiments and discuss experimental results of our work in Sects. 4 and 5. Section 6 presents our conclusions.

3 590 J Comb Optim (2014) 28: Previous works The small world network, proposed by Watts and Strogatz (1998), has been extensively studied. They argued that small world networks mathematically follow the power law distribution. This theory is also called six degree space theory (Missen 2008). According to this theory, although most of nodes in a network are not connected directly, almost all nodes can be reached with a small number of hops. In a small world network, a person is treated as a node, and the connection between them represents that people know each other. Thus, the small world network can reflect the small world phenomenon that strangers can be connected by those whom they mutually know. Barabási and Albert (1999), as well as Barabási et al. (1999), showed that the power law can be used to simulate small world networks. Gu et al. (2013) proposed a generalized small world network, which contains several small world network models, and proved mathematically that a small-world network generalized with a given expectation of edge number possesses large clustering coefficient and small diameter. Lu et al. (2014) analyzed the efficient Influence Spread Estimation for Influence Maximization under the Linear Threshold Model. GDB are now a viable alternative to Relational Database Systems (RDBMS). Social networking and recommendation engines are all examples of applications that can be represented in a much more natural form. The social graph leverages information across a range of networks in order to quantify the relationships between individuals. It is necessary to find an effective way to store network oriented data, so as to computing the shortest path in social networks. Unlike in relational databases, in which data is organized as tables, its storage style allows more agile data query. Neo4j (Hoque and Gupta 2012) is an open-source, high-performance, enterprise-grade NoSQL graph database. With the Neo4j network storage model, algorithms such as Dijkstra (1959) can be used to solve the shortest path problem. However, this kind of methods cannot be easily scaled up for large networks and depends on hardware strongly. With the rapid rise of the Internet web 2.0 sites, the traditional relational database in coping web 2.0 site, especially the ultra-large-scale and high concurrent SNS type of the web 2.0 pure dynamic website, appears to be inadequate, exposing a lot of insurmountable problems. Non-relational database such as NoSQL has been rapidly developed to address this challenge. Identified by its non-adherence to relational database model, NoSQL belongs to a class of database management system. Major internet companies, such as Google, Amazon, and Facebook, have been using this new database system to deal with huge quantities of data. To manage large volumes of data, NoSQL do not necessarily follow a fixed schema. Apach HBase and Google Bigtable are examples of NoSQL. While Bigtable is a more proprietary technology, HBase is an open source, non-relational distributed database. It runs on top of HDFS. Compared to general relational databases, HBase is better suitable for unstructured data. With the rapid development of cloud computing, many corresponding formulations based on cloud platform have appeared. Lin and Dyer (2010) focused on MapReduce algorithm design. They designed the breath-first search (BSF) algorithm based on MapReduce distributed platform; it is called the parallel BSF. This parallel algorithm can make full use of distributed computing resources of the platform for a particular point in the network to get the shortest path distance to other all points. McCubbin

4 J Comb Optim (2014) 28: and Perozzi (2011) proposed a K-shortest path algorithm to compute the shortest K path by using the sort phase of MapReduce framework. 3 Path algorithm based on HBase To run a traditional shortest path algorithm in the Hadoop cloud platform, we need to transform it into a MapReduce format first. The parallel BFS algorithm by Lin and Dyer (2010) cannot make full use of the efficient distributed storage architecture of the Hadoop. Our proposed algorithm uses the HBase database and coprocessor in the HBase to improve the operation efficiency of parallel BFS. 3.1 HBase HBase can be run on top of HDFS, and is a column-oriented database management system. This allows HBase to read the contents of the large data easily. Unlike traditional database, such a non-relational database does not support a structured query language like SQL. Facebook uses HBase as its storage infrastructure because of its special characteristic: high write throughput, low latency random reads, elasticity, cheap and fault tolerant and strong consistency within a data center. An HBase system is constituted by a set of tables, each of which contains rows and columns. It uses a primary key in the table to access data stored in it. In HBase, multiple attributes can be combined into column families, and data elements of which are then stored together. Thus, when column families are involved in HBase, the table schema and the column families must be predefined and specified. Similar to graph data, social network data can be stored in HBase. Every edge is associated with a weight. The weight of a path represents the sum of the weights of the selected edges. This database uses a 3-tuple of a row, column family, and column qualifier as a key to index values to stores entries. Using the following schema, a graph can be represented as adjacency lists: row := node name column family := name of the edge type column qualifier := name of the incident node value := edge weight (floating point) With this schema it is straightforward to store hypergraphs by simply using different column family identifiers. To indicate which particular graph elements should be considered, we set a list of edge types as one of their parameters to our algorithm. For any particular source, we add a column to the HBase table to store the shortest path tree: row := node name column family := the string pointer column qualifier := the shortest path source value := a pointer string There are two comma-separated values in the pointer string: a total cost of this shortest path from the node to the source, which we refer to as cost (src; n); and the next hop in the shortest path to the source.

5 592 J Comb Optim (2014) 28: Parallel BFS to strike a single-source shortest path tree We implement our parallel BFS algorithm based on HBase. Map a social network into a network graph full of nodes. Any node has two attributes: node ID and the distance from this node to the source node. The source node s distance to itself is initialized to 0 and other nodes are initialized to infinite. System will travel through all nodes based on BFS cyclically. In the map phase, system will emit the source node and send a message to its adjacent nodes. The objects represented by these nodes usually like our family members in real life. When these nodes receive the message, they will update their distance to 1 by 0 plus 1. Then the system will propagate the node s distance. At reduce phase, system will check the type of the received information to determine whether it is a node or a distance. Then the system will reconstruct network graph and find the minimum distance. When each cycle is finished, the system will check if all nodes are iterative. If the answer is no, the cycle will continue like the first time. If the answer is yes, the cycle will be broken, and then we will get a dendrogram with the node s distance. We also apply the HBase coprocessor to the cycle. Thus, at the reduce phase, including reconstruct network graph and finding the minimum distance could be omitted to speed up the time required by each HBase coprocessor. Finally we will get a simplified algorithm. Its pseudo-code is as follows: Algorithm 1 Pseudo code of improved parallel BFS Class Mapper Method Map(nid n, node N) seen {} For all cell i in N do Cost = cell i value cost Pointer= cell i value pointer Neighbor= cell i neighbornode Seen[pointer]=(neighbor,cost) If seen.has(src) and seen.has(dst) then totalcost= seen[src].cost+seen[dst].cost emit(totalcost, N.key) Class Reducer visited {} pathlist [] Method Reduce(cost,cells) For all cell i in cells do If emittedpaths<=k then If not visited.has(cell i ) then Path=ReassemblePath(cell i ) visited.addall(path) pathlist.append(cell i ) For all path i in pathlist do Emit(path i ) As with Dijkstra s algorithm, assume that a connected and directed graph is stored as adjacency lists. The distance to each node is directly stored alongside the adjacency list of that node, and all the nodes will be initialized to 1 except for the source node. As

6 J Comb Optim (2014) 28: shown in the pseudo-code, the node id (an integer) and the node s corresponding data structure (adjacency list and current distance) are represented by n and N respectively. The algorithm works by mapping all nodes and propagating the key-value pair for each neighbor on the node s adjacency list. The key contains the node id of the neighbor, and the value saves the current distance to the node plus one. If we can reach node n with distance d, then we must be able to reach all the nodes that are connected to n with distance d + 1. After shuffling and sorting, the reducer stage will receive keys corresponding to the destination node ids and distances from all paths to that node. The reducer will select the shortest value of these distances and then update the distance in the node data structure. It is apparent that parallel BFS is an iterative algorithm. Each iteration corresponds to a MapReduce job. When the algorithm starts, nodes connected to the source can be identified. As iteration keeps going on, other nodes that are connected to the identified nodes and their shortest distances can be discovered. Similar to the mapreduce processes to pass graph structure, there are both map and reduce processes in the BFS method. These processes become a burden to computational efficiency, so we provide improvements to BFS by leveraging HBase coprocessors. Coprocessors can be loaded globally on all tables and regions hosted by the region server, and these are known as system coprocessors. The administrator can also specify which coprocessors should be loaded on all regions for a table on a per-table basis, and these are known as table coprocessors. The BFS method is improved by using coprocessor. The map phase is responsible for the value of the transmission path and the network structure in the traditional BFS method, but Coprocessor component architecture allows us to complete the update of the value of the node shortest path d in the map phase. There are two advantages of this method. The first one is that it only needs the map phase, so a lot of time consumed in the process of sort-shuffle can be saved. The second one is that the data storage mode of HBase can help to reduce the consumption of the data transfer process. By doing so, updates to the pointer tree are made to some nodes before the mappers read their data. We refer to this property as pointer cascading. This leads to one iteration of the algorithm that can extend the effective frontier of the shortest paths by more than one hop. HBase will be forced to update the data after mapping to reduce the number of iterations. 3.3 K-shortest paths To find new nodes, in our application we need to find paths that are cycle free and also meet the additional requirement: each path in the set of K-shortest paths contains a node not presented on any other path in the set. In this way, new nodes can be found. This property is called node uniqueness. After the single-source shortest path tree that is stored in the HBase database got from our algorithm in Sect. 3.2, the K-shortest paths in the graph can be determined with one additional pass. In the Map function, we check each node to see whether there is a shortest path pointer to both the source and destination. If the answer is yes, the path from the source through the node to

7 594 J Comb Optim (2014) 28: the destination formed by passing the two shortest paths together is a candidate for a K-shortest path from the source to destination. To inform the reducer of this fact, we emit a key/value pair in which the key is the sum of both pointers costs and the value is the node s identifier. We have assumed that the graph is undirected; as a result, one of the two shortest paths can be reversed. Courtesy of the sort-shuffle phase, our reducer receives the candidate shortest paths in cost order. Ties are broken by the node numbering; the lowest numbered paths are discovered first. The reducer must, however, do more than simply choose the first resultant k candidates. We are interested in unique shortest paths that have at least one interesting node. Therefore our reducer keeps a set of nodes that have seen on previous paths. It accepts, and rejects the shortest path suggestions that are just composed of the nodes it has seen before. The reducer must also guard against cycles that may be produced by the shortest path pointer algorithm. In the omitted function ReassemblePath, we reconstruct each proposed path from HBase and check it for cycles before accepting it. The pseudo-code of this process is as follows: Algorithm 2 Pseudo-code of K Shortest Paths Class Mapper Method Map(nid n, node N) seen {} For all cell i in N do Cost = cell i value cost Pointer= cell i value pointer Neighbor= cell i neighbornode Seen[pointer]=(neighbor,cost) If seen.has(src) and seen.has(dst) then totalcost=seen[src].cost+seen[dst].cost emit(totalcost, N.key) Class Reducer visited {} pathlist [] Method Reduce(cost,cells) For all cell i in cells do If emittedpaths<=k then If not visited.has(cell i ) then Path=ReassemblePath(cell i ) visited.addall(path) pathlist.append(cell i ) For all pathi in pathlist do Emit(path i ) 3.4 Optimization sorting From the previous step, we get the K source nodes to the destination node of the shortest path set. The question of this phase is whether the user can easily read path centralized information. The answer is negative, because the path ordering set is just according to the path value size, and no other information is available. When calculating the path, many

8 J Comb Optim (2014) 28: important paths may be missed if the value of K is too small; on the contrary, there might be too many paths. Thus, we should choose an optimal value for K to identify an appropriate number of paths. There are many methods to optimize network diagram path. According to the needs of users, we can use different algorithm combinations. Here we choose intermediary degree and degree of central algorithm to optimize cloud platform output results. In the real case, we not only expect knowing which nodes are in the path, but also need to learn if someone in the relationship chain plays a key role, because the key role of people may be the one who is known by both parties. Thus, we choose intermediary degrees (betweenness) to optimize the path set. Betweenness Centrality, betweenness for short (Brandes 2001), is an important concept in social network analysis. It represents the number of the shortest path of all the nodes. Betweenness is a good description of the contribution of those nodes in every path to the connection between two nodes. The larger betweenness of a node is, the more paths go through the node and the bigger probability the node connects other nodes. Consider a weighted directed (multi)-graph G = (V, E) with n = V, m = E. Let SP st denote the set of shortest paths between source s and target t and SP st (v) the subset of SP st consisting of paths that have v in their interior. Then, the betweenness centrality for node v is C B (v) = s =v =t V σ st (v)/σ st (1) where σ st := Spst and σst (v) := Spst (v) In addition we choose degree centricity as another optimization factor. The degree centrality of a node is defined as the number of edges that connect to this node Qin and Li (2011). Generally, the higher centrality a node has the more popular or wellconnected the node is. In our algorithm, the degree centrality for node v is written as: C D (v) = deg (v)/(n 1) (2) where deg (v) means the degree of v, and n equals the number of v. 4 The analysis of the experiment We conducted an experiment to evaluate the performance of our algorithm. We built a cloud system used simulated networks for the experiment. 4.1 Hardware configuration of the cloud system We used Hadoop for cloud platform configuration. The cloud cluster contains 10 nodes, and each node has a 2.6-GHz CPU, 8 GB of RAM, and 1 TB of hard disk.

9 596 J Comb Optim (2014) 28: Fig. 1 A network pattern we used to simulate social network There are 128 mapper and 64 reducer in default. After large number of tests, we set the most proper rate between the quantities of mapper, reduce the running speed of image algorithm to 2: Network model It is hard to get the real large network data for our test. Thus, we used mathematic modeling to simulate network data. Social network is a random network, and has its own regulation. It belongs to small world. Clustering phenomenon will appear in the whole network model, and obey the Power-law distribution. As a result, we used the math model developed by Holme and Kim (2002) to modeling the network with acceptable network properties. We used NetworkX, a Python language software package for the generation, manipulation, and study of the structure, dynamics, and function of complex network. The structure of social networks is similar to graph, in which people can be represented by node and the relationships of people can be represented by edges. As a result, it is very suitable to analyze social network data by NetworkX. It can draw multiple network graphics, and we use one kind of these graphics to generate the network data. Figure 1 shows the network pattern we used to simulate social network. NetworkX can generate networks that follow power-law distribution. The simulated network we used is fairly large. It contains 100 million nodes and 600 million edges. 5 Results and discussion 5.1 Time complexity Since each line will receive a new message in the process of iterative formulation, the complexity of node degree for each network diagram is O(1), and its time complexity is O( log E ). The amount of all the message is O(E). The complexity of storing data to HBase table structure equals O(log E ). As shown in Fig. 2a, we compare the efficiency of our algorithm with the parallel BFS. Our improved parallel BFS algorithm outperforms the traditional one. As shown in Fig. 2a, when the number of nodes is small, these two algorithm make no difference. With the increase of the number of nodes in the social network, our improved algorithm is more efficient than the traditional algorithm. The time

10 J Comb Optim (2014) 28: complexity of K-shortest path algorithm is O(V log V ), so is that of BFS algorithm. The time complexity in shuffle-sort phase equals O(E log E ). Figure 2b shows the running time of K-shortest path algorithm of different nodes. The curve of running time from 1 to 4 belongs to linear distribution basically. From 5 to 10, the growth speed of running time is faster and finally consume more time than the situation in Fig. 2a. 5.2 Discussion As shown in Table 1, the influence of cascade phenomenon on the number of iteration is significant, especially when data is forced to be written into HBase immediately after mapping. To deal with the large network with 100 million nodes and 600 million edges, our method consumes 6 h and 30 min to compute the shortest paths, which is acceptable. Running time of the parallel BFS algorithm accounted for half, and the rest time was for running K-shortest path algorithm. When K is 20, the running time of the optimization algorithm is trivial, compared with those of other steps. Fig. 2 a Efficiency Comparison of two algorithms. b Running time of K-shortest paths algorithm of different nodes

11 598 J Comb Optim (2014) 28: Table 1 Comparison of the number of BFS iteration with different numbers of nodes Node number Traditional iteration times Improved iteration times Cascade iteration times Path length Conclusion This paper proposed a parallel BFS algorithm to compute the shortest path of a social network based on the HBase. This algorithm analyzes the social network model. It achieves the shortest path between different nodes in network under the parallel environment, and can optimize the output from cloud platform by using the intermediary degrees and degree central algorithm. We validated the efficiency of the proposed algorithm with a simulated large social network. The experiment results indicate that the proposed algorithm can improve the efficiency of parallel BSF. Acknowledgments This study was supported by the National Natural Science Foundation of China (Grant No , , ); Natural Science Foundation of Shanxi Province (Grant No ) and Programs for Science and Technology Development of Shanxi Province (Grant No ). This work was also supported in part by the US National Science Foundation (NSF) under Grant no. CNS and CCF References Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509 Barabási A-L, Albert R, Jeong H (1999) Mean-field theory for scale-free random networks. Physica A 272:173 Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2): Chang F, et al. (2006) Bigtable: a distributed storage system for structured data. OSDI Dean J, Ghemawat S (2008) Mapreduce simplified data processing on large clusters. Commun ACM 51(1): Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1(1): Gu L, Huang HL, Zhang XD (2013) The clustering coefficient and the diameter of small-world networks. Acta Mathematica Sinica 29: English Series Holme P, Kim BJ (2002) Growing scale free networks with tunable clustering. Phys Rev E 65: Hoque I, Gupta IC (2012) Disk layout techniques for online social network data. IEEE INTERNET COM- PUTING Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. SIGOPS 44(2):35 40 Lin J, Dyer C (2010) Data-intensive text processing with MapReduce, ser. Synthesis lectures on human language technologies. Morgan and Claypool Publishers, Florida Lu Z, Fan L, Wu W, Thuraisingham B, Yang K (2014) Efficient influence spread estimation for influence maximization under the linear threshold model. To appear in computational, social networks

12 J Comb Optim (2014) 28: McCubbin C, Perozzi B (2011) Finding the Needle : Locating interesting nodes using the K-shortest paths algorithm in MapReduce th IEEE International Conference on Data Mining Workshops Missen MMSC (2008) The small world of web network graphs. International Multi Topic Conference on Wireless Networks, Information Processing and Systems, IMTIC Qin L, Li H (2011) Centrality analysis of BBS reply networks International Conference on Information Technology, Computer Engineering and Management Sciences, ICM September Škrabálek J, Kunc P, Nguyen F (2013) Towards effective social network system implementation/new trends in databases and information systems. Springer Berlin, Heidelberg, pp Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):S1 Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393:440

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing