International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION Atul Kumar Srivastava 1, Mitali Srivastava 2, Rakhi Garg 3, P. K. Mishra 4 1, 2, 4 Department of Computer Science, Faculty of Science, Banaras Hindu University, Varanasi, India 3 Computer Science Section, Mahila Maha Vidayalaya, Banaras Hindu University, Varanasi, India ABSTRACT: Web search engine uses several ranking algorithms to determine the ordering of web pages. PageRank method has become one of the most popular and successful method after used by Google search engine to rank web pages. Power method takes more computation time and resources due to iterative nature of PageRank method. To reduce its computing time many researchers have focused on an efficient method to compute PageRank score for a very large web graph. An algebraic Gauss-Seidel method is used by several researchers to compute PageRank score and observed that it takes less number of iterations than power method. In this paper, we have done experimental analysis of Power method and Gauss-Seidel method with Hash-map data-structure to compute PageRank score and observed that Gauss-Seidel method takes 40%-45% less number of iteration than Power method to compute PageRank score. Keywords: Hash-map, PageRank method, Power method, Gauss-Seidel method, Experimental analysis of PageRank Power method and Gauss-Seidel method. [1] INTRODUCTION Today, Web is becoming one of the most popular medium for web users to access information. Due to huge amount of data on web, it is very crucial for the web users to access relevant information in efficient time [3]. Several web page ranking methods are used to rank the web pages according to relevancy so that web user get the required web page. There are two important web page ranking algorithms: PageRank and HITS proposed by Brin & Page and John Kleinberg respectively [1, 3]. These two algorithms iteratively computes rank of the web pages. PageRank algorithm computes the rank of web pages based on single prestige score while HITS algorithm compute the rank of web pages taking account of two prestige score i.e. hub and authority score [7, 13]. Brin & Page computed rank of web pages by power method [1]. PageRank computation is very compute-intensive and resource deprivation method. It take several days to compute the rank of billion web pages. Since web pages of many web sites regularly updated so there is need of re-computation of rank of web pages to maintain the relevancy of search results [5, 13]. Due to these factors, the effective and efficient PageRank computation is demanded. Atul Kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K. Mishra 1
COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION Many researchers have tried to make PageRank efficient by using system architectures. Boldi and Vigma proposed a method to compute PageRank in main memory by compressing the large web graph [2]. Both Haveliwala [8] & Chen et al. [5] efficiently computed PageRank in external memory by minimizing the overhead of I/O sequence of operations. In addition, several researchers has used some algebraic technique to compute PageRank efficiently e.g. Kamvar er al. has taken large web graph as a local blocks of many inter-domain hyper-links, and compute PageRank of these local blocks before combining the results to get global rank [11]. They have also avoid re-computation of previously fixed PageRank values and speed up the PageRank computation by frequently removal of approximation of principal Eigen-vector from present iteration [10]. Kamvar & Haveliwala [9] observed the eigen-value of the equation and improves the convergence rate of PageRank method. Arasu et al. [4] used Gauss-Seidel method to compute PageRank score because it converges rate is faster than Power method for large dataset. In this paper, we computed PageRank algorithm by using Power method and Gauss- Seidel method. We have observed the result of both method on the basis of number of iteration and time taken to converge the methods and analysed that Gauss-Seidel method is more effective for large dataset to compute the PageRank algorithm as it takes approximate 40%-45% less number of iteration than the power method. The rest of paper is organized as follows, Section 2 describes some basic terminology of graph and data-structure to store the hyperlink matrix. Section 3 discuss the computation of PageRank method by Power method and Gauss-Seidel method. In Section 4, we do the comparative analysis of these algorithms based on number of iteration and time taken in its convergence. Further Section 5 concludes the paper. [2] DATASET AND SOME BASIC TERMINOLOGY To compute the PageRank scores, we store the web graph into Hash-map datastructure. Because sparse matrix takes large storage due to many zero elements present in hyperlink matrices. For example- let a small graph contains only six nodes as shown in [Figure-1]. The corresponding hyperlink matrix contains both 0 and 1 entry and is of order 6*6 i.e. n*n. Figure 1: Web graph and corresponding hyperlink matrix of six nodes 2
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 Since in the PageRank computation only nonzero entry of hyperlink matrix is required. So we store only the non-zero entry in Hash-map data-structure thus not only the storage gets reduced but it also faster the access of data [13]. We need following data-structure during PageRank computation: Hash-map (Key: Values) where web-pages corresponding to key point to web pages contained in Values. From graph shown in [Figure 1] of six node we can see that for key containing webpage 1 the corresponding web pages in value becomes 2, 3 and similarly, we obtain the web pages for values corresponding to keys 2, 3, 4, 5, and 6 are shown in [Figure 2]. Figure 2: Data-structure to store web-graph Here -1 in Hash-map denotes that key 2 is a dangling node. Reverse Hash-map (Key: Values) where web pages corresponding to Key is pointed by the web pages corresponds to Values. From [Figure-1] we can see that the web page 1 contained in key is pointed by web page 3. Similarly we obtain the corresponding value of by 2, 3, 4, 5 and 6 as shown in [Figure-2]. One Single column array corresponding to the out-degree of every web pages in graph. If any node is dangling node then its out-degree is number of nodes in the web graph i.e. n. From figure 1 we obtain following single array for web pages 1, 2, 3, 4, 5 and 6". Out-degree: [2 6 3 2 2 1] Single column array which contains the dangling nodes. From figure 1 as we can see that node 2 is a dangling node: [2] By using Hash-map data-structure we only store non-zero entry per row. In above example hyperlink data-structure takes (n*n) i.e. 36 storage element while Hash-map takes only 10 storage element. As shown in figure 2 for large datasets Hash-map data structure would be better in terms: Accessing of the element and storage of elements than hyperlink matrix [3, 13]. We have implemented Hash-map data-structure in Java language using Guava library provided by Google [14]. [3] COMPUTATION OF PAGERANK METHOD The PageRank method was proposed by the founders of Google search engine Brin & Page in the late 1990 s and has been applied on Google search engine [1]. Specifically it is Atul Kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K. Mishra 3
COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION computed by the number of incoming links to the web page as well as the rank of the web pages from those links initiated. PageRank compute rank of web pages offline and it does not influenced by user s search query. Recently, application of the PageRank have been used to rank many other objects in order of significance e.g. Scientific articles or manuscript linked by citation, author linked by co-authorship and protein in biology system [3, 5, 13]. To formulate the above concepts, we treat web as a directed graph where web pages are treated as nodes and edge corresponding to hyper-links. The total number of web pages in web graph is denoted by n= V. The PageRank score of web page i is defined by Brin & Page as [13]:- Where is the out-degree of web page i. Mathematically, we have n linear equations with n unknowns variables. Let A be the adjacency matrix of web graph by following definition: We can write system on n linear equation as following:- In the above equation is the PageRank vector, this equation is the characteristic equation of Eigen-system, where the solution to vector P is an Eigen-vector with the corresponding Eigenvalue 1. Due to circular definition of the above equation iterative methods is used to solve it. There are two issues with this iterative procedure on web graph: one is rank sink issue and another is cycle problem [7, 13]. After solving these two issues PageRank score of web page is computed as:- Here α denotes damping factor it takes values between o to 1 (normally α=0.85). [4.1] PAGERANK POWER METHOD In this section we discuss Power method that is the basic method used by Brin & Page to compute the PageRank vector. Power method is one of the simplest & oldest iterative method to find out the dominant eigen-value and eigen-vector of a sparse hyperlink matrix 4
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 [13]. The following equation is used to compute the PageRank of web pages by power method: In power method, we initially assign 1 n rank to all web pages of the PageRank vector. The iteration starts with the initial assignment of PageRank values and ends when the PageRank values do not change much in successive iteration then it converges to a particular PageRank vector. The convergence criteria i.e. threshold value taken ε = 10-7. The algorithm to compute PageRank by power method proposed by Brin & Page is given below: Figure 3: PageRank Power method Algorithm The PageRank Power method computed on the following dataset which is collected from Stanford large network dataset collection website which contains various type of datasets crawled from social network sites, road networks, autonomous system graphs etc. [12]: Table 1: Description of Datasets Dataset Number of nodes Dangling nodes Dataset 1 8846 4996 Dataset 2 22687 16466 Dataset 3 36692 0 The results obtained after the execution of above algorithm on datasets are analysed on Atul Kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K. Mishra 5
COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION the basis of two factor i.e. number of iteration and time taken to converge the algorithm. From figure 4 and figure 5 we can see that as we increase the value of damping factor the number of iteration and time taken to converge also increases. For the damping factor α 0.5 there is slightly change in number of iterations and time of convergence while for 0.5 α 0.9 there is huge increment. Figure 4: No. of Iteration taken to converge Power method for Figure 5: Time taken to converge Power method for [5] GAUSS-SEIDEL PAGERANK ALGORITHM Gauss-Seidel method is also an iterative method to compute linear system of equations one at a time in sequence way and uses earlier computed results in current iteration as soon as they are available. The only difference in Power method and Gauss-Seidel method is that in power method the rank of web pages obtained in k th iteration will be used in computation of (k+1) th iteration while in Gauss-Seidel method the rank value of web pages obtained in k th iteration will be used in k th iteration for other web pages. Following equation is used to compute PageRank using Gauss-Seidel method [4]:- 6
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 We initialize 1 n as rank score of all web pages in PageRank vector. The Gauss-Seidel method starts with this initial assignment of the PageRank vector and computes the rank value of web pages in iterative nature by using above formula and uses previously computed results as soon as they become available for that iteration. [Figure-6] presented algorithm to compute PageRank vector using Gauss-Seidel method by Hash-map data-structure proposed by Arasu et al. [4]. Figure 6: PageRank Computation using Gauss-Seidel Algorithm The result are observed after the computation of PageRank algorithm using Gauss-Seidel method on the dataset. From [Figure-7] and [Figure-8] we can say that number of iteration and time taken to converge the Gauss-Seidel method increases with the increase of value of damping factor. Figure 7: Number of iteration taken to converge by Gauss-Seidel method Atul Kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K. Mishra 7
COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION Figure 8: Time taken by Gauss-Seidel method [6] OBSERVATIONS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION We have implemented these two methods in JAVA language, and used Guava library to implement Hash-map and Immutable Multi-map data-structure. We have done experiment on single Linux machine (Ubuntu 14.04 LTS), an Intel Core i5 CPU 3.2 GHz. From [Figure 9] it is very clear that for a given dataset 1, dataset 2 and dataset 3 there is minute differences in number of iteration generated by Gauss-Seidel and Power method for damping factor value α in rang [0.1, 0.6] but a huge gap can be seen for value of α in between 0.6 and 0.9 i.e. [0.6, 0.9]. Also it has been observed from [Figure 9] that is for the value α=0.85 Gauss-Seidel method takes about 40% to 45% less number of iteration than Power method to converge. Figure 9 (a) Figure 9(b) 8
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 Figure 9(c) Figure 9 (a, b, c) shows the comparisons of Gauss-Seidel and Power method for different Datasets with tolerance value = 10-7 [6] CONCLUSION Web search engines uses several ranking algorithms to determine the ordering of web pages. PageRank method is one of the most widely used. To compute PageRank score datastructure is to be used that takes less storage to faster its access. It has been observed that for the large web graphs hyperlink matrix takes more storage and access time than Hash-map data structure. From the experiments performed by us on different datasets observed that as the web graph gets larger in size the Power method to compute PageRank score takes more number of iterations as compared to Gauss-Seidel method. It concludes that for large web graph Gauss-Seidel is preferred over Power method to compute PageRank. REFERENCES [1] S. Brin, L. Page (1998), The Anatomy of a Large-scale Hyper textual Web Search Engine Proceedings of the Seventh International World Wide Web Conference, Page(s):107-117. [2] Boldi, Paolo, and Sebastiano Vigna. "The webgraph framework I: compression techniques." Proceedings of the 13th international conference on World Wide Web. ACM, 2004. [3] Pavel Berkhin (2005), A survey on PageRank computing, Internet Mathematics 2, Vol.1, Page(s):73 120. [4] Arasu, Arvind, et al. "PageRank computation and the structure of the web: Experiments and algorithms." Proceedings of the Eleventh International World Wide Web Conference, Poster Track. 2002. [5] Pretto, L.: A theoretical analysis of googles PageRank. In: Laender,A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 131144. Springer, Heidelberg (2002). [6] Chen, Yen-Yu, Qingqing Gan, and Torsten Suel. "I/O-efficient techniques for computing PageRank." Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002. Atul Kumar Srivastava, Mitali Srivastava, Rakhi Garg, P. K. Mishra 9
COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION [7] Srivastava, Atul Kumar, et al. "International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS) www. iasir. net." algorithms 3.7: 14. [8] Haveliwala, Taher. "Efficient computation of PageRank." (1999). [9] Haveliwala, Taher, and Sepandar Kamvar. "The second eigenvalue of the Google matrix." Stanford University Technical Report (2003). [10] Kamvar, Sepandar, Taher Haveliwala, and Gene Golub. "Adaptive methods for the computation of PageRank." Linear Algebra and its Applications 386 (2004): 51-65. [11] Kamvar, Sepandar, et al. "Exploiting the block structure of the web for computing pagerank." Stanford University Technical Report (2003). [12] Jure Leskovec and Andrej Krevl, Stanford Large Network Dataset Collection, http://snap.stanford.edu/data, june-2014. [13] Langville, A.N., Meyer, C.D.: Googles PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton (2006). [14] http://code.google.com/p/guava-libraries/ 10