Delta-K 2 -tree for Compact Representation of Web Graphs

Size: px

Start display at page:

Download "Delta-K 2 -tree for Compact Representation of Web Graphs"

Wilfred Cameron
5 years ago
Views:

1 Delta-K 2 -tree for Compact Representation of Web Graphs Yu Zhang 1,2, Gang Xiong 1,, Yanbing Liu 1, Mengya Liu 1,2, Ping Liu 1, and Li Guo 1 1 Institute of Information Engineering, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences {zhangyu,xionggang,liuyanbing,liumengya,liuping,guoli}@iie.ac.cn Abstract. The World Wide Web structure can be represented by a directed graph named as the web graph. The web graphs have been used in a wide range of applications. However, the increasingly large-scale web graphs pose great challenges to the traditional memory-resident graph algorithms. In the literature, K 2 -tree can efficiently compress the web graphs while supporting fast querying in the compressed data. Inspired by K 2 -tree, we propose the Delta-K 2 -tree compression approach, which exploits the characteristics of similarity between neighbor nodes in the web graphs. In addition, we design a node reordering algorithm to further improve the compression ratio. We compare our approach with the state-of-the-art algorithms, including K 2 -tree,, and. Experimental results of web graph compression on four datasets show that our Delta-K 2 -tree approach outperforms the other three in compression ratio ( bits per link), and meanwhile supports fast forward and reverse querying in graphs. Keywords: Web graphs, Compact data structures, Graph compression, Adjacency matrix 1 Introduction In the applications of web management and mining, the World Wide Web structure can be represented by a directed graph, where each web page corresponds to a graph node and each hyperlink corresponds to a graph edge. Such a directed graph is known as web graph. Lots of basic algorithms and operations are based on the web graphs to analysis and mine the inner structure of the web. For example, some famous webpage ranking algorithms, such as Pagerank [1] and HITS [2] used in the primary search engines, are based on the web graph structure. Their key techniques are computing the out-degree and in-degree of each node and analysis the connected relations between different nodes. With the explosive development of the Internet, the scale of web graphs is growing at an amazing speed. To meet the need of large-scale graph data management, there Corresponding author.

2 Y. Zhang et al. is a trend towards studying efficient compression techniques and fast querying algorithms in recent years. Traditional methods for storing and manipulating the web graphs mostly store a graph in an adjacency matrix or list. In order to guarantee efficient querying, it requires the entire adjacency matrix or list to be loaded into the memory. However, it s not practical for the increasingly large scale of graph data with millions of nodes and edges to be memory-resident. According to the official report by CNNIC (China Internet Network Information Center) [3], the numbers of web pages and hyperlinks were about 86.6 billion and 1 trillion respectively by the end of 212 in China. This web graph has to be stored using adjacent list over 16TB. The huge memory space poses great challenge to the traditional storing methods. There exist three aspects of researches to solve excessive storage problem: (1) Storing the graph in external memory since the external memory is much cheaper and larger compared with main memory [4, 5]. (2) Using distributed system to partition the graph into small subgraphs and manipulating subgraphs in distributed computers [6 8]. (3) Converting the graph to compact form which requires less space while supporting fast querying [9 12]. In our research, we focus on the third aspect and aim to represent web graphs in highly compact form, thus manipulating huge graphs in main memory. In practice, such compression algorithm is beneficial for the former two aspects of research. For the external memory scheme, the locality of access will be promoted since much more compressed graph data is available in the main memory at one time. For the distributed system scheme, highly compact structure will allow fewer computers to do the same work and reduce the network traffic. Among all the algorithms for compressing graphs, K 2 -tree [11] is a representative algorithm with high compression ratio and fast querying performance. This algorithm uses an adjacency matrix to represent a graph and exploits its sparsity for effective compression. However, K 2 -tree ignores an important characteristic of the similarity between adjacent rows or columns in the adjacency matrix, which can be exploited for improving the compression ratio. In this paper, we proposed a new tree-form structure named as Delta-K 2 - tree. A series of experiments indicate that our approach outperforms K 2 -tree in compression ratio while still supporting fast querying. Furthermore, a node reordering algorithm is proposed to make better use of the similarity between nonadjacent rows or columns, which can further improve the compression ratio of Delta-K 2 -tree. 2 Related Works Researchers in the field of web graph compression are mostly interested in forming a compact representation which supports efficient querying operations, such as checking the connected relation between two page nodes, extracting the successors of any page node, etc. The most influential representative in this trend is [9] framework. When we use for compressing the web

3 Delta-K 2 -tree for Compact Representation of Web Graphs graphs made up by URLs (Uniform Resource Locator), the URLs have been previously sorted in lexicographical order aiming to make similar URLs appear in adjacent locations. According to the similarity between the adjacent URLs, the method achieves a good trade-off between compression ratio and querying speed. Variants of the [13 16] keep optimizing the storage space by more effective encoding and reordering techniques. With the same reordering process as in the previous stage, [12] further exploits the structural characteristics of the web graph adjacency matrix. In the research, six kinds of regular sub-graphs are extracted and compressed to achieve high compression ratio. Whereas the querying speed of finding all neighbors of the given page is particularly slow since the query requires numerous accesses to all the sub-graphs. Instead of using the lexicographical order, algorithm proposed in [1] reorders the web graph nodes based on the Bradth First Search (BFS) scheme. Taking the advantage of similarity between adjacent nodes in the adjacency list after node reordering, is competitive with in compression efficiency and querying speed. All approaches mentioned above just provide forward querying operation and that they can be simply converted into one that supports bidirectional querying operations. In [16], a web graph is divided into two sub-graphs, where one contains all bidirectional edges and the other contains all unidirectional edges. The method compresses both of the above two sub-graphs and a transposed graph of the unidirectional sub-graph. However, such extended methods require extra space to store the transposed graph. In [11], Brisaboa et. al. present a K 2 -tree structure that offers forward and reverse query without constructing the transposed graph. It highly considers the properties of large empty areas of the graph adjacency matrix and gives very good compression ratio. In this paper, we improve the performance of K 2 -tree via exploiting the similarity between adjacent nodes in the graph adjacency matrix and reordering nonadjacent nodes to further improve the compression ratio. We compared our method with the best alternatives in the literature, offering a series of space/time analysis according to the underlying experimental results. 3 Preliminary 3.1 Notation As used herein, a directed graph G = (V, E) indicates a web graph, where V represents the set of nodes and E represents the set of edges in the graph. Each node corresponds to a page and each edge corresponds to a link. Using n(n = V ) indicates the number of nodes and m = E indicates the number of edges. A square matrix {a i,j } only containing s and 1s indicates the adjacency matrix. a i,j is 1 if there is an edge from v i to v j and otherwise.

4 Y. Zhang et al. 3.2 K 2 -tree In [11], an unbalanced tree structure named K 2 -tree represents an adjacency matrix. In the K 2 -tree, each node stores 1 bit information, or 1. Every node in the last level of the K 2 -tree represents an element in the matrix and every other node represents a sub-matrix in the matrix. Except in the last level of the K 2 -tree, the node stored 1 corresponds to the sub-matrix containing at least one 1 and the node stored corresponds to the sub-matrix containing all s. In the phase of K 2 -tree construction, the n n adjacency matrix is divided into K 2 equal parts and each part is a n K n K sub-matrix. Each of the submatrixes corresponds to a child of the root of K 2 -tree. If and only if a submatrix contains at least one 1, the child is 1, otherwise the child is. For those children who are 1, go on dividing them into K 2 equal parts recursively until the sub-matrix contains all s or only one element. In real web graphs, m is far less than n 2 so that the adjacency matrix is extremely sparse. Due to the characteristic of sparsity, K 2 -tree achieves high compression ratio of the web graphs by using one node to represent a sub-matrix containing all s. [11] proves that, in the worst case, the total space of K 2 -tree is K 2 m(log K 2 n2 m + O(1)) bits which is asymptotically twice the informationtheoretic lower bound necessary to represent all the matrices of n n with m 1s. In the phase of query, for two given nodes v i and v j, we can use K 2 -tree to determine if a i,j is or 1. Using the root as the current node, find a child which represents the sub-matrix containing a i,j. a i,j is if the child stores. Otherwise, using the child as the current node, go on finding a child of the current node until the child stores or the current node has no child. a i,j is if the last node we find store and a i,j is 1 otherwise. In practice, if n is not a power of K, the matrix could be extended to K log K n K log K n by adding s at the right and the bottom. The K 2 -tree is stored in two bit arrays, T and L. T stores nodes except those in the last level via traversing the K 2 -tree level by level from left to right. L stores nodes in the last level from left to right. Fig. 1 shows an adjacency matrix and K 2 -tree according to the matrix when K = 2, and s in the grey area are added to solve the problem that n is not a power of K. In order to find a child of the given node of K 2 -tree efficiently by using T and L, T needs to permit Rank query. Rank(T, i)( i < T ) counts the number of 1s from position up to position i in T. The first position in T is. For example, a given node of K 2 -tree represented by T [i] has children if T [i] = 1, then the s-th child of the node is at position Rank(T, s) K 2 + s of T : L. T : L represent the connection of T and L. [17] proves that Rank can be calculated in constant time using sub-linear space. 3.3 Rank The implementation of Rank in [17] achieves very good results theoretically, however the realization is complicated. [18] proposes a simple implementation

5 Delta-K 2 -tree for Compact Representation of Web Graphs T = L = Fig. 1. The adjacency matrix and the corresponding K 2 -tree. and shows that in many practical cases the simpler solutions are better in terms of time and extra space. For a bit array T, the method uses an array R to store every B position of Rank, R[ i B ] = Rank(T, i B B), and uses an array popc to store number of 1s in all the different b-bit array. Then Rank(T, i) = R[ i B B]+ i b 1 k= popc[t [ i B B +k B, i B B +(k +1) B 1]+popc[T [ i b b, i b b + b 1]& }{{}... }{{} ], where T [i, j] indicates T from i-th position i mod b b (i mod b) to j-th position and B is a multiple of b. When the length of T is t, the length of R is t B and the length popc is 2b. Due to that we can use mm popcnt u64 in SSE (Streaming SIMD Extensions) to calculate the number of 1s in 64-bit integer, we set b to 64 and use T, an array of 64-bit integers, to store every 64 bits of the bit array in practical applications in Fig. 2. B is set to 2 w b for the convenience of the programming. As w increases, the computation increases and the space decreases simultaneously. procedure Rank(T, i) result := R[i>>(6+w)] // 2^6 is 64 for(k := (i>>(6+w))<<w, k < (i>>6)), k ++) result += _mm_popcnt_u64(t[k]) result += _mm_popcnt_u64(t[k]>>(x3f-i&x3f)) // x3f is 64 return result Fig. 2. The Rank algorithm. 4 Delta-K 2 -tree 4.1 Motivation By taking advantage of adjacency matrix s sparsity, K 2 -tree compresses the web graph efficiently and its space is k 2 m(log K 2 n2 m ) + O(1)) bits in the worst case. We prove Theorem 1 that as m decrease the total space of K 2 -tree, in the worst case, decreases when K and n are not changed. According to the theoretical analysis, if we can reduced the number of 1s and unchanged the size of the matrix simultaneously, it can reduce the space of K 2 -tree.

6 Y. Zhang et al. Theorem 1 The space of K 2 -tree of the sparse matrix, in the worst case, decreases with number of 1s decreases in the case of unchanging n and K. Proof. For y = K 2 m(log K 2 n2 m +O(1)), let a = K2, b = n 2, c = O(1), and x = n2 m, then y = a b x (log a x+c). The derivative of y is y. y = ab x 2 ln a (ln e a ln x). when c x > e a, y < and y decreases with x increases. According to sparsity of the c matrix, x is greater than e a obviously. c 4.2 Construction and Query The characteristic of similarity between neighbors of different pages has been found and is used widely in compression algorithms such as and. We also use the characteristic to reduce the number of 1s. We use a matrix named Delta-matrix to store the difference between adjacent rows or columns in the adjacency matrix. We take rows for example. The Delta-matrix can be constructed with the method in Fig. 3, where Count1s(matrix[i]) and Count- Dif(matrix[i], matrix[j]) represent the number of 1s in i-th row in the matrix and the number of differences between same positions in i-th row and j-th row in the matrix. D in Fig. 3 is a n-bit array to record which rows in the Delta-matrix represent the differences. According to the construction, the number of 1s in the Delta-matrix is not greater than that in the adjacency matrix. procedure Delta-matrix_Construction(matrix[n][n]) D[] := Delta-matrix[] := matrix[]. for(i := 1, i < n, i++) if(count1s(matrix[i]) < CountDif(matrix[i], matrix[i-1])) D[i] := Delta-matrix[i] := matrix[i] else D[i] := 1 create a n-bit array R for(k:=, k<n, k++) if(matrix[i][k] == matrix[i-1][k]) R[k] := else R[k] := 1 Delta-matrix[i] := R return Delta-matrix, D Fig. 3. The construction for Delta-matrix. The Delta-matrix and the n-bit array D instead of the adjacency matrix can be used to represent web graphs. We use {a i,j } to represent the adjacency matrix and {a i,j } to represent the Delta-matrix. Elements in the adjacency matrix can be obtained from the Delta-matrix and D by formulate (1) where means exclusive-or and s is the number of consecutive 1s in D from i-th position forward.

7 Delta-K 2 -tree for Compact Representation of Web Graphs a i,j = { a i,j, ifd[i] = a i,j a i 1,j... a i s,j, ifd[i] = 1 (1) We use K 2 -tree to compress the Delta-matrix instead of the adjacency matrix to reduce the space. However, we need to access the K 2 -tree of the Delta-matrix several times to obtain an element in the adjacency matrix. So if the number of consecutive 1s in D is very large, query will become very time-consuming. To resolve this problem, we propose two methods: (1) We replace nodes in the last level of K 2 -tree of the Delta-matrix with elements of the same positions in the adjacency matrix. We call the modified K 2 -tree Delta-K 2 -tree. For example in Fig. 4, the dotted line indicates nodes replaced. (2) When using Delta-K 2 -tree, if we access a node stored which is not in the last level, then it means all elements in the sub-matrix represented by the node are all s. So one access can obtain several elements. In practical applications using the above two methods, one query to obtain an element in the adjacency matrix merely needs about 2 accesses to Delta-K 2 -tree on average. In addition, Delta-K 2 -tree can use similarity between adjacent columns as same as adjacent rows, which can be selected according to the actual situation. D Delta-matrix K2-tree for Delta-matrix Delta-K2-tree Fig. 4. The K 2 -tree for the Delta-Matrix and the corresponding Delta-K 2 -tree 4.3 Nodes Reordering Delta-K 2 -tree uses the characteristic of similarity between adjacent nodes in web graphs. Actually, the similar nodes may not be adjacent. We can use nodes reordering method to change the order of nodes in the web graph to make better use of the characteristic. That is to find an order of nodes in order to obtain the Delta-matrix with the minimal 1s. We use a directed graph G = (V, E) to represent the similarity of nodes in the matrix. In this subsection, G does not represent the web graph. v i in V represents i-th node and the weights of e(v i, v j ) for every two different vertexes is the the minimum of the number of i-th node s neighbors and the number of difference between i-th node s neighbors and j-th node s neighbors. For an

8 Y. Zhang et al. n nodes web graph, there is a graph G containing n vertexes and n(n 1) edges. Every Hamiltonian path in G corresponds to an order of nodes in the web graphs and the weights of the path is the number of 1s in the Delta-matrix. So, the problem is transferred into the shortest Hamiltonian path problem. The shortest Hamiltonian path is a NP-complete problem, so we propose a heuristic algorithm to solve it. The algorithm randomly selects a starting vertex and traverses all vertexes once by edge of the current vertex with minimal value. The order of vertexes in the shortest Hamiltonian path is the order of nodes in the web graph. 5 Experiments 5.1 Experimental environment and test data Our test dataset are real web graphs obtained from the Laboratory for Web Algrithmics [9]. Table 1 describes the numbers of nodes and edges and the the filenames on their website [19].Our experiments are based on the operation system Red Hat Enterprise Linux 6. Server (64 bits) with Intel(R) Core(TM) i7-382cpu@3.6ghz and 32GB RAM. All tests use only one CPU core. The compilers used are gcc version and java version Table 1. Description of testing practical Web graphs. Web graphs Nodes Edges Filename uk 1, 3,5,615 uk-27-5@1 cnr 325,557 3,216,152 cnr-2 eu 862,664 19,235,14 eu-25 in 1,382,98 16,917,53 in-24 We compare Delta-K 2 -tree with the state-of-the-art algorithms, including K 2 -tree,, and, in memory space and querying speed over the test data. We implement K 2 -tree and Delta-K 2 -tree in C++. The version of we use is which is publicly available at [19]. The version of we use is.3.2 which is publicly available at [2]. and both are implemented in Java. 5.2 Memory space comparison with different options Table 2 shows the comparison in memory space between K 2 -tree and Delta- K 2 -tree with different options. Space is measured in bpe (bits per edge), by dividing the total space of the compressed data by the number of edges in the web graphs. We configure K 2 -tree and Delta-K 2 -tree with parameter K = 2, 4. Rank is configured with parameter B = 512. For Delta-K 2 -tree, we test four different

9 Delta-K 2 -tree for Compact Representation of Web Graphs options. Delta-K 2 -tree use similarity between adjacent rows or columns in the adjacency matrix are labeled with row and column. Nodes reordering before compression is labeled with reorder. Results show that our proposal leads to about 4% reduction in space with K 2 -tree. In different options, compression efficiency by using similarity of columns is better than rows. Compression efficiency can be improved significantly by our nodes reordering method. Table 2. Memory space comparison (in bpe) between K 2 -tree and Delta-K 2 -tree with different options. uk cnr eu in K=2 K=4 K=2 K=4 K=2 K=4 K=2 K=4 K 2 -tree Delta-K 2 -tree(row) Delta-K 2 -tree(column) Delta-K 2 -tree(row+reorder) Delta-K 2 -tree(column+reorder) Reduction in space 44% 38% 34% 42% 39% 33% 41% 35% 5.3 Memory space comparison with other approaches Table 3 shows the comparison in memory space among K 2 -tree,, and Delta-K 2 -tree. Space is measured in bpe. We configure with parameters w = 7 and m = 3, configure with parameters l = 1 and configure K 2 -tree and Delta-K 2 -tree with parameter K = 2 and B = 512 to favor compression over speed. As and are based on adjacency list, they only support forward querying. We use the technique proposed in [16] to solve the problem by using some extra space, which has been introduced in related work. Results show that the space of our proposal is minimal among all algorithms while supporting both forward and reverse querying. Table 3. Memory space comparison (in bpe) with other approaches. Web graphs K 2 -tree Delta-K 2 -tree uk cnr eu in

10 Y. Zhang et al. 5.4 Space/speed trade-off comparison with other approaches We do this experiment while and only supporting forward querying without any extra space. We test querying speed in tow aspects, query for link and query for neighbors. Query for link represents checking the connecting relation between two given nodes. Query for neighbors is to obtain all neighbors of the given node. Space is measured in bpe. Speed is measured in nspe (nanoseconds per edge). Speed of query for link is the time of one query. Speed of query for neighbors is calculated by dividing the time of one query by the number of the neighbors. Fig. 5 shows the space/speed trade-off comparison of query for link, and Fig. 6 shows the space/speed trade-off comparison of query for neighbors.we configure with parameters (w, m) = (1, 1), (3, 3), (7, 3), configure with parameters l = 4, 8, 16, 1 and configure K 2 -tree and Delta-K 2 -tree with parameter K = 2 and B = 64, 128, 256, 512. On querying speed, our proposal does not have advantages. When querying for link, K 2 -tree is the fastest. When querying for neighbors, is the fastest. However, our proposal shows better space/speed trade-off performance especially in querying for link. When querying for link, if we need high compression and fast speed at the same time, Delta-K 2 -tree is the best choice Querying for link over uk Delta Querying for link over cnr Delta Delta Querying for link over eu Querying for link over in Delta Fig. 5. Space/speed trade-off of querying for link over uk, cnr, eu and in.

11 Delta-K 2 -tree for Compact Representation of Web Graphs Querying for neighbors over uk Delta Querying for neighbors over cnr Delta Querying for neighbors over eu Delta Querying for neighbors over in Delta Fig. 6. Space/speed trade-off of querying for neighbors over uk, cnr, eu and in. 6 Conclusions and future work We have presented a new compression method, Delta-K 2 -tree, for web graphs by taking advantage of the characteristics of similarity of the hyperlinks and sparsity of the adjacency matrices and a node reordering algorithm to further improve compression. We compare it with the common used alternatives [9 11] in the field. Our experiments show that it achieves high compression ratio while supporting fast forward and reverse querying. When querying for checking the connecting relation between two given pages, it is a competitive method to satisfy the requirement of high compression and fast querying. The node reordering algorithm can improve compression of Delta-K 2 -tree, however it can not get the optimal solution. Thus, to design new heuristic n- ode reordering algorithm is one of our possible future works. How to improve querying speed using Delta-K 2 -tree is also a consideration for us. Acknowledgement. This research was supported by the National Natural Science Foundation of China (No ); the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDA6362); the National High Technology Research and Development of China (863 Program) (No. 211AA175, 212AA1252).

12 Y. Zhang et al. References 1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Computer networks and ISDN systems 3(1), pp: (1998) 2. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Journal of the ACM (JACM) 46(5), pp: (1999) 3. China Internet Network Information Center, research/bgxz/tjbg/2121/t212116_23668.html 4. Vitter, J.S.: External memory algorithms and data structures: Dealing with massive data. In: ACM Computing surveys (CsUR) 33(2), pp: (21) 5. Vitter, J.S.: Algorithms and data structures for external memory. In: Foundations and Trends in Theoretical Computer Science2(4), pp: (28) 6. Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, N.: Distributed query processing using partitioned inverted files. In: SPIRE 21.Proceedings. Eighth International Symposium on. IEEE, pp: 1-2 (21) 7. Tomasic, A., Garcia-Molina, H.: Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In: Parallel and Distributed Information Systems, pp: Proceedings of the Second International Conference on. IEEE (1993) 8. Yu, G., Gu, Y., Bao, Y. B., Wang, Z.G.: Large scale graph data processing on cloud computing environments. In: Chinese Journal of Computers 34(1), pp: (211) 9. Boldi, P, Vigna, S.: The Webgraph Framework I: Compression techniques. In: the 13th international conference on World Wide Web. ACM, pp: (24). 1. Apostolico, A., Drovandi, G.: Graph compression by BFS. In: Algorithms 2(3), pp: (29) 11. Brisaboa, N.R., Ladra, S., Navarro, G.: k2-trees for compact web graph representation. In: String Processing and Information Retrieval, pp: Springer Berlin Heidelberg (29) 12. Asano, Y., Miyawaki, Y., Nishizeki, T.: Efficient compression of web graphs. In: Computing and Combinatorics, pp: Springer Berlin Heidelberg (28) 13. Boldi, P., Vigna, S.: The Framework II: Codes For The World-Wide Web. In: the Conference on Data Compression, pp: 528. IEEE Computer Society (24) 14. Boldi, P., Santini, M., Vigna, S.: A large time-aware web graph. In: ACM 42(2), pp: ACM SIGIR Forum (28) 15. Boldi, P., Santini, M., Vigna, S.: Permuting web graphs. In: Algorithms and Models for the Web-Graph, pp: Springer Berlin Heidelberg (29) 16. Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In: the 2th international conference on World Wide Web. ACM, pp: (211) 17. Jacobson, G.: Space-efficient static trees and graphs. In: Foundations of Computer Science, pp: th Annual Symposium on. IEEE (1989) 18. Gonzalez, R., Grabowski, S., Makinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA 5), pp: (25) 19. Homepage, 2. Drovandi, G., PhD Web Site, software.php

Compressed Representation of Web and Social Networks via Dense Subgraphs

Compressed Representation of Web and Social Networks via Dense Subgraphs Cecilia Hernández 12 and Gonzalo Navarro 2 1 Dept. of Computer Science, University of Concepción, Chile, 2 Dept. of Computer Science,