A Parallel Algorithm based on Monte Carlo for Computing the Inverse and other Functions of a Large Sparse Matrix

Size: px

Start display at page:

Download "A Parallel Algorithm based on Monte Carlo for Computing the Inverse and other Functions of a Large Sparse Matrix"

Hollie Stevenson
5 years ago
Views:

1 A Parallel Algorithm based on Monte Carlo for Computing the Inverse and other Functions of a Large Sparse Matrix Patrícia Isabel Duarte Santos Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. José Carlos Alves Pereira Monteiro Prof. Juan António Acebron de Torres Examination Committee Chairperson: Prof. Alberto Manuel Rodrigues da Silva Supervisor: Prof. José Carlos Alves Pereira Monteiro Member of the Committee: Prof. Luís Manuel Silveira Russo November 2016

2 ii

3 To my parents: Alda e Fernando; To my brother: Pedro. iii

4 iv

5 Resumo Atualmente, a inversão de matrizes desempenha um papel importante em várias áreas do conhecimento. Por exemplo, quando analisamos características específicas de uma rede complexa como a centralidade do nó ou comunicabilidade. Por forma a evitar a computação explícita da matriz inversa, ou outras operações computacionalmente pesadas sobre matrizes, existem vários métodos eficientes que permitem resolver sistemas de equações algébricas lineares que têm como resultado a matriz inversa ou outras funções matriciais. Contudo, estes métodos, sejam eles diretos ou iterativos, têm um elevado custo quando a dimensão da matriz aumenta. Neste contexto, apresentamos um algoritmo baseado nos métodos de Monte Carlo, como uma alternativa à obtenção da matriz inversa e outras funções duma matriz esparsa de grandes dimensões. A principal vantagem deste algoritmo é o facto de permitir calcular apenas uma linha da matriz resultado, evitando explicitar toda a matriz. Esta solução foi paralelizada usando OpenMP. Entre as versões paralelizadas desenvolvidas, foi desenvolvida uma versão escalável, para as matrizes testadas, que usa a diretiva omp declare reduction. Palavras-chave: métodos de Monte Carlo, OpenMP, algoritmo paralelo, operações sobre uma matriz, redes complexas v

6 vi

7 Abstract Nowadays, matrix inversion plays an important role in several areas, for instance, when we analyze specific characteristics of a complex network such as node centrality and communicability. In order to avoid the explicit computation of the inverse matrix, or other matrix functions, which is costly, there are several high computational methods to solve linear systems of algebraic equations that obtain the inverse matrix and other matrix functions. However, these methods, whether direct or iterative, have a high computational cost when the size of the matrix increases. In this context, we present an algorithm based on Monte Carlo methods as an alternative to obtain the inverse matrix and other functions of a large-scale sparse matrix. The main advantage of this algorithm is the possibility of obtaining the matrix function for only one row of the result matrix, avoiding the instantiation of the entire result matrix. Additionally, this solution is parallelized using OpenMP. Among the developed parallelized versions, a scalable version was developed, for the tested matrices, which uses the directive omp declare reduction. networks Keywords: Monte Carlo methods, OpenMP, parallel algorithm, matrix functions, complex vii

8 viii

9 Contents Resumo Abstract v vii List of Figures xiii 1 Introduction Motivation Objectives Contributions Thesis Outline Background and Related Work Application Areas Matrix Inversion with Classical Methods Direct Methods Iterative Methods The Monte Carlo Methods The Monte Carlo Methods and Parallel Computing Sequential Random Number Generators Parallel Random Number Generators The Monte Carlo Methods Applied to Matrix Inversion ix

10 2.5 Language Support for Parallelization OpenMP MPI GPUs Evaluation Metrics Algorithm Implementation General Approach Implementation of the Different Matrix Functions Matrix Format Representation Algorithm Parallelization using OpenMP Calculating the Matrix Function Over the Entire Matrix Calculating the Matrix Function for Only One Row of the Matrix Results Instances Matlab Matrix Gallery Package CONTEST toolbox in Matlab The University of Florida Sparse Matrix Collection Inverse Matrix Function Metrics Complex Networks Metrics Node Centrality Node Communicability Computational Metrics Conclusions Main Contributions x

11 5.2 Future Work Bibliography 51 xi

12 xii

13 List of Figures 2.1 Centralized methods to generate random numbers - Master-Slave approach Process 2 (out of a total of 7 processes) generating random numbers using the Leapfrog technique Example of a matrix B = I A and A, and the theoretical result B 1 = (I A) 1 of the application of this Monte Carlo method Matrix with value factors v ij for the given example Example of stop probabilities calculation (bold column) First random play of the method Situating all elements of the first row given its probabilities Second random play of the method Third random play of the method Algorithm implementation - Example of a matrix B = I A and A, and the theoretical result B 1 = (I A) 1 of the application of this Monte Carlo method Initial matrix A and respective normalization Vector with value factors v i for the given example Code excerpt in C with the main loops of the proposed algorithm Example of one play with one iteration Example of the first iteration of one play with two iterations Example of the second iteration of one play with two iterations Code excerpt in C with the sum of all the gains for each position of the inverse matrix xiii

14 3.9 Code excerpt in C with the necessary operations to obtain the inverse matrix of one single row Code excerpt in C with the necessary operations to obtain the matrix exponential of one single row Code excerpt in C with the parallel algorithm when calculating the matrix function over the entire matrix Code excerpt in C with the function that generates a random number between 0 and Code excerpt in C with the parallel algorithm when calculating the matrix function for only one row of the matrix, using omp atomic Code excerpt in C with the parallel algorithm when calculating the matrix function for only one row of the matrix, using omp declare reduction Code excerpt in C with omp delcare reduction declaration and combiner Code excerpt in Matlab with the transformation needed for the algorithm convergence Minnesota sparse matrix format inverse matrix function - Relative Error (%) for row 17 of matrix inverse matrix function - Relative Error (%) for row 33 of matrix inverse matrix function - Relative Error (%) for row 26 of matrix inverse matrix function - Relative Error (%) for row 51 of matrix inverse matrix function - Relative Error (%) for row 33 of matrix and row 51 of matrix node centrality - Relative Error (%) for row 71 of pref matrix node centrality - Relative Error (%) for row 71 of pref matrix node centrality - Relative Error (%) for row 71 of and pref matrices node centrality - Relative Error (%) for row 71 of smallw matrix node centrality - Relative Error (%) for row 71 of smallw matrix node centrality - Relative Error (%) for row 71 of and smallw matrices node centrality - Relative Error (%) for row 71 of minnesota matrix xiv

15 4.15 node communicability - Relative Error (%) for row 71 of pref matrix node communicability - Relative Error (%) for row 71 of pref matrix node communicability - Relative Error (%) for row 71 of and pref matrix node communicability - Relative Error (%) for row 71 of smallw matrix node communicability - Relative Error (%) for row 71 of smallw matrix node communicability - Relative Error (%) for row 71 of and smallw matrix node communicability - Relative Error (%) for row 71 of minnesota matrix omp atomic version - Efficiency(%) for row 71 of pref matrix omp atomic version - Efficiency(%) for row 71 of pref matrix omp atomic version - Efficiency(%) for row 71 of smallw matrix omp atomic version - Efficiency(%) for row 71 of smallw matrix omp declare reduction version - Efficiency(%) for row 71 of pref matrix omp declare reduction version - Efficiency(%) for row 71 of pref matrix omp declare reduction version - Efficiency(%) for row 71 of smallw matrix omp declare reduction version - Efficiency(%) for row 71 of smallw matrix omp atomic and omp declare reduction and version - Speedup relative to the number of threads for row 71 of pref matrix xv

16 xvi

17 Chapter 1 Introduction The present document describes an algorithm to obtain the inverse and other functions of a large-scale sparse matrix, in the context of a master s thesis. We start by presenting the motivation behind this algorithm, the objectives we intend to achieve, the main contributions of our work and the outline for the remaining chapters of the document. 1.1 Motivation Matrix inversion is an important matrix operation that is widely used in several areas such as financial calculation, electrical simulation, cryptography and complex networks. One area of application of this work is in complex networks. These can be represented by a graph (e.g., the Internet, social networks, transport networks, neural networks, etc.), and a graph is usually represented by a matrix. In complex networks there are many features that can be studied, such as the node importance in a given network, node centrality, and the communicability between a pair of nodes that measures how well two nodes can exchange information with each other. These metrics are important when we want to the study of the topology of a complex network. There are several algorithms over matrices that allow us to extract important features of these systems. However, there are some properties which require the use of the inverse matrix, or other matrix functions, which is impractical to calculate for large matrices. Existing methods, whether direct or iterative, have a costly approach in terms of computational effort and memory needed for such problems. Therefore, Monte Carlo methods represent a viable alternative approach to this problem since they can be easily parallelized in order to obtain a good performance. 1

18 1.2 Objectives The main goal of this work, considering what was stated in the previous section, is to develop a parallel algorithm based on Monte Carlo for computing the inverse and other matrix functions of large sparse matrices in an efficient way, i.e., with a good performance. With this in mind, our objectives are: To implement an algorithm proposed by J. Von Neumann and S. M Ulam [1] that makes it possible to obtain the inverse matrix and other matrix functions based on Monte Carlo methods; To develop and implement a modified algorithm based on the item above that has its foundation on the Monte Carlo methods; To demonstrate that this new approach improves the performance of matrix inversion when compared to existing algorithms; To implement a parallel version of the new algorithm using OpenMP. 1.3 Contributions The main contributions of our work include: The implementation of a modified algorithm based on the Monte Carlo methods to obtain the inverse matrix and other matrix functions; The parallelization of the modified algorithm when we want to obtain the matrix function over the entire matrix, using OpenMP; Two versions of the parallelization of the algorithm when we want to obtain the matrix function for only one row of the matrix: one using omp atomic, and another one using omp declare reduction; A scalable parallelized version of the algorithm, using omp declare reduction, for the tested matrices. All the implementations stated above were successfully executed, with special attention to the version that calculates the matrix function for a single row of the matrix, using omp declare reduction, which is scalable and capable of reducing the computational effort compared with other existing methods, at least the synthetic matrices tested. This is due to the fact that instead of requiring the calculation of the matrix function over the entire matrix it calculates the matrix function for only one row of the matrix. It has a direct application, for example, when a study of the topology of a complex network is required, being able to effectively retrieve the node importance of a node in a given network, node centrality, and the communicability between a pair of nodes. 2

19 1.4 Thesis Outline The rest of this document is structured as follows. In Chapter 2, we present existent application areas, some background knowledge regarding matrix inversion classical methods, the Monte Carlo methods and some parallelization techniques, as well as some previous work on algorithms that aim to increase the performance of matrix inversion using the Monte Carlo methods and parallel programming. In Chapter 3, we describe our solution: an algorithm to perform matrix inversion and other matrix functions, as well as the underlying methods/techniques used in the algorithm implementation. In Chapter 4, we present the results, where we specify the procedures and measures that were used to evaluate the performance of our work. Finally, in Chapter 5, we summarize the highlights of our work and present some future work possibilities. 3

20 4

21 Chapter 2 Background and Related Work In this chapter we cover many aspects related to the computation of matrix inversion. Such aspects are important to situate our work, understand the state of the art and what we can learn and improve from that to accomplish our work. 2.1 Application Areas Nowadays, there are many areas where efficient matrix functions, such as the matrix inversion, are required. For example, in image reconstruction applied to computed tomography [2] and astrophysics [3], and in bioinformatics to solve the problem of protein structure prediction [4]. This work will mainly focus on complex networks, but it can easily be applied to other application areas. A Complex Network [5] is a graph (network) with very large dimension. So, a Complex Network is a graph with non-trivial topological features that represents a model of a real system. These real systems can be, for example: The Internet and the World Wide Web; Biological systems; Chemical systems; Neural networks. A graph G = (V, E) is composed of a set of nodes (vertices) V and edges (links) E represented by unordered pairs of vertices. Every network is naturally associated with a graph G = (V, E) where V is the set of nodes in the network and E is the collection of connections between nodes, that is E = {(i, j) there is an edge between node i and node j in G}. 5

22 One of the hardest and most important tasks in the study of the topology of such complex networks is to determine the node importance in a given network, and this concept may change from application to application. This measure is normally referred to as node centrality [5]. Regarding the node centrality and the use of matrix functions, Kylmko et al. [5] show that the matrix resolvent plays an important role. The resolvent of an n n matrix A is defined as: R(α) = (I αa) 1 (2.1) where I is the identity matrix and α C excluding the eigenvalues of A (that satisfy det(i αa) = 0), and 0 < α < 1 λ 1, where λ 1 is the maximum eigenvalue of A. The entries of the matrix resolvent count the number of walks in the network, penalizing longer walks. This can be seen by considering the power series expansion of (I αa) 1 : (I αa) 1 = I + αa + α 2 A α k A k + = α k A k (2.2) Here, [(I αa) 1 ] ij, counts the total number of walks from node i to node j, weighting walks of length k by α k. The bounds on α (0 < α < 1 λ 1 ) ensure that the matrix I αa is invertible and the power series in (2.2) converges to its inverse. Another property that is important when we are studying a complex network is the communicability between a pair of nodes i and j. This measures how well two nodes can exchange information with each other. According to Kylmko et al. [5], this can be obtained using the matrix exponential function [6] of a matrix A defined by the following infinite series: k=0 e A = I + A + A2 2! + A3 3! + = k=0 A k k! (2.3) with I being the identity matrix and with the convention that A 0 = I. In other words, the entries of the matrix, [e A ] ij, count the total number of walks from node i to node j, penalizing longer walks by scaling walks of length k by the factor 1 k!. As a result, the development and implementation of efficient matrix functions is an area of great interest since complex networks are becoming more and more relevant. 2.2 Matrix Inversion with Classical Methods The inverse of a square matrix A is the matrix A 1 that satisfies the following condition: AA 1 = I (2.4) 6

23 where I is the identity matrix. Matrix A only has an inverse if the determinant of A is not equal to zero, det(a) 0. If a matrix has an inverse, it is also called non-singular or invertible. To calculate the inverse of a n n matrix A, the following expression is used A 1 = 1 det(a) C (2.5) where C is the transpose of the matrix formed by all of the cofactors of matrix A. For example, to calculate the inverse of a 2 2 matrix, A = a b c d the following expression is used A 1 = 1 det(a) d b c a = 1 ad bc d b c a (2.6) and to calculate the inverse of a 3 3 matrix, A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 we use the following expression A 1 = 1 det(a) a 22 a 23 a 32 a 33 a 13 a 12 a 33 a 32 a 12 a 13 a 22 a 23 a 23 a 21 a 33 a 31 a 11 a 13 a 31 a 33 a 13 a 11 a 23 a 21 a 21 a 22 a 31 a 32 a 12 a 11 a 32 a 31 a 11 a 12 a 21 a 22. (2.7) The computational effort needed increases with the size of the matrix as we can see in the previous examples with 2 2 and 3 3 matrices. So, instead of computing the explicit inverse matrix, which is costly, we can obtain the inverse of an n n matrix by solving a linear system of algebraic equations that has the form Ax = b = x = A 1 b (2.8) where A is an n n matrix, b is a given n-vector, x is the n-vector unknown solution to be determined. These methods to solve linear systems can be either Direct or Iterative [6, 7], and they are presented in the next subsections. 7

24 2.2.1 Direct Methods Direct Methods for solving linear systems provide an exact solution (assuming exact arithmetic) in a finite number of steps. However, many operations need to be executed, which takes a significant amount of computational power and memory. For dense matrices, even sophisticated algorithms have a complexity close to T direct = O(n 3 ). (2.9) Regarding direct methods we have many ways for solving linear systems such as Gauss-Jordan Elimination and Gaussian Elimination, also known as LU factorization or LU decomposition (see Algorithm 1) [6, 7]. Algorithm 1 LU Factorization. 1: Initialize: U = A L = I 2: for k = 1 : n 1 do 3: for i = k + 1 : n do 4: L(i, k) = U(i, k)/u(k, k) 5: for j = k + 1 : n do 6: U(i, j) = U(i, j) L(i, k)u(k, j) 7: end for 8: end for 9: end for Iterative Methods Iterative Methods for solving linear systems consist of successive approximations to the solution that converge to the desired solution, x k. An iterative method is considered good depending on how quickly x k converges. To obtain this convergence, theoretically, an infinite number of iterations is needed to obtain the exact solution, although, in practice, the iteration stops when some norm of the residual error b Ax is as small as desired. Considering Equation (2.8), for dense matrices they have a complexity of T iter = O(n 2 k) (2.10) where k is the number of iterations. The Jacobi method (see Algorithm 2) and the Gauss-Seidel method [6, 7] are well known iterative methods, but they do not always converge because the matrix needs to satisfy some conditions for that to happen (e.g., if the matrix is diagonally dominant by rows for the Jacobi method, and, e.g., if the matrix is symmetric and positive definite for the Gauss-Seidel method). The Jacobi method has an unacceptably slow convergence rate and the Gauss-Seidel method, 8

25 Algorithm 2 Jacobi method. Input: A = a ij b x (0) T OL N 1: Set k = 1 2: while k N do 3: 4: for i = 1, 2,... n do 5: x i = 1 a ii [ n 6: end for 7: j=1,j i ( a ijx (0) 8: if x x (0) < T OL then 9: OUTPUT(x 1, x 2, x 3,... x n ) 10: STOP 11: end if 12: Set k = k : 14: for i = 1, 2,... n do 15: x (0) i = x i 16: end for 17: end while 18: OUTPUT(x 1, x 2, x 3,... x n ) 19: STOP j ) + b i ] tolerance maximum number of iterations despite the fact that is capable of converging quicker than the Jacobi method, it is often still too slow to be practical. 2.3 The Monte Carlo Methods The Monte Carlo methods [8] are a wide class of computational algorithms that use statistical sampling and estimation techniques, applied to synthetically constructed random populations with appropriate parameters, in order to evaluate the solutions to mathematical problems (whether they have a probabilistic background or not). This method has many advantages, especially when we have very large problems and when these problems are computationally hard to deal with, i.e., to solve analytically. There are many applications of the Monte Carlo methods in a variety of problems in optimization, operations research, and systems analysis, such as: integrals of arbitrary functions; predicting future values of stocks; solving partial differential equations; sharpening satellite images; 9

26 modeling cell populations; finding approximate solutions to NP-hard problems. The underlying mathematical concept is related with the mean value theorem which states that I = b a f(x) dx = (b a) f (2.11) where f represents the mean (average) value of f(x) in the interval [a, b]. Due to this, the Monte Carlo methods estimate the value of I by evaluating f(x) at n points selected from a uniform random distribution over [a, b]. The Monte Carlo methods obtain an estimate for f that is given by: f 1 n 1 f(x i ) (2.12) n i=0 The error in the Monte Carlo methods estimate decreases by the factor of 1 n, i.e., the accuracy increases at the same rate The Monte Carlo Methods and Parallel Computing Another advantage of choosing the Monte Carlo methods is that they are usually easy to migrate them onto parallel systems. In this case, with p processors, we can obtain an estimate p times faster and decrease error by p compared to the sequential approach. However, the enhancement of the values presented before depends on the fact that random numbers are statistically independent, because each sample can be processed independently. Thus, it is essential to develop/use good parallel random number generators and know which characteristics they should have Sequential Random Number Generators The Monte Carlo methods rely on efficient random number generators. The random number generators that we can find today are, in fact, pseudo-random number generators, for the reason that their operation is deterministic and the produced sequences are predictable. Consequently, when we refer to random number generators, we are referring, in fact, to pseudo-random number generators. Regarding random number generators, they are characterized by the following properties: 1. uniformly distributed, i.e., each possible number is equally probable; 2. the numbers are uncorrelated; 10

27 3. it never cycles, i.e., the numbers do not repeat themselves; 4. it satisfies any statistical test for randomness; 5. it is reproducible; 6. it is machine-independent, i.e., the generator has to produce the same sequence of numbers on any computer; 7. if the seed value is changed, the sequence has to change too; 8. it is easily split into independent sub-sequences; 9. it is fast; 10. it requires limited memory requirements. Observing the properties stated above we can conclude that there are no random number generators that adhere to all these requirements. For example, since the random number generator may take only a finite number of states, there will be a time when the numbers it produces will begin to repeat themselves. There are two important classes of random number generators [8]: Linear Congruential: produce a sequence X of random integers using the following formula: X i = (ax i 1 + c) mod M (2.13) where a is the multiplier, c is the additive constant, and M is the modulus. The sequence X depends on the seed X 0 and its length is 2 M at most. This method may also be used to generate floating-point numbers x i between [0, 1], dividing X i by M. Lagged Fibonacci: produces a sequence X and each element is defined as follows: X i = X i p X i q (2.14) where p and q are the lags, p > q, and is any binary arithmetic operation such as exclusive-or or addition modulo M. The sequence X can be a sequence of either integer or float-point numbers. When using this method it is important to choose the seed values, M, p and q well, resulting in sequences with very long periods and good randomness Parallel Random Number Generators Regarding parallel random number generators, they should ideally have the following properties: 11

28 1. no correlations among the numbers in different sequences; 2. scalability; 3. locality, i.e., a process should be able to spawn a new sequence of random numbers without interprocess communication. The techniques used to transform a sequential random number generator into a parallel random number generator are the following [8]: Centralized Methods Master-Slave approach: as Fig. 2.1 shows, there is a master process that has the task of generating random numbers and distributing them among the slave processes that consume them. This approach is not scalable and it is communication-intensive, so others methods are considered next. Figure 2.1: Centralized methods to generate random numbers - Master-Slave approach. Decentralized Methods Leapfrog method is comparable in certain respects to a cyclic allocation of data to tasks. Assuming that this method is running on p processes, the random samples interleave every p th element of the sequence beginning with X i, as shown in Fig Figure 2.2: Process 2 (out of a total of 7 processes) generating random numbers using the Leapfrog technique. This method has disadvantages: despite the fact that it has low correlation, the elements of the leapfrog subsequence may be correlated for certain values of p; this method does not support the dynamic creation of new random number streams. 12

29 Sequence splitting is similar to a block allocation of data of tasks. Considering that the random number generator has period P, the first P numbers generated are divided into equal parts (non-overlapping) per process. Independent sequences consist in having each process running a separate sequential random generator. This tends to work well as long as each task uses different seeds. Random number generators, specially for parallel computers, should not be trusted blindly. Therefore, the best approach is to do simulations with two or more different generators, and the results compared to check whether the random number generator is introducing a bias, i.e., a tendency. 2.4 The Monte Carlo Methods Applied to Matrix Inversion The result of the application of these statistical sampling methods depends on how an infinite sum of finite sums is done. An example of such methods is random walk, a Markov Chain Monte Carlo algorithm, which consists in the series of random samples that represents a random walk through the possible configurations. This fact leads to a variety of Monte Carlo estimators. The algorithm implemented in this thesis is based on a classic paper that describes a Monte Carlo method of inverting a class of matrices, devised by J. Von Neumann and S. M Ulam [1]. This method can be used to invert a class of n-th order matrices, but it is capable of obtaining a single element of the inverse matrix without determining the rest of the matrix. To better understand how this method works we present a concrete example and all the necessary steps involved. B A theoretical ===== results B 1 = (I A) Figure 2.3: Example of a matrix B = I A and A, and the theoretical result B 1 = (I A) 1 of the application of this Monte Carlo method. Firstly, there are some restrictions that, if satisfied, guarantee that the method produces a correct solution. Let us consider as an example the n n matrix A and B in Fig The restrictions are: Let B be a matrix of order n whose inverse is desired, and let A = I B, where I is the identity matrix. For any matrix M, let λ r (M) denote the r-th eigenvalue of M, and let m ij denote the element of 13

30 M in the i-th row and j-th column. The method requires that max r When (2.15) holds, it is known that 1 λ r (B) = max λ r (A) < 1. (2.15) r (B 1 ) ij = ([I A] 1 ) ij = (A k ) ij. (2.16) k=0 All elements of matrix A (1 i, j n) have to be positive, a ij 0, let us define p ij 0 and v ij the corresponding value factors, that satisfy the following: p ij v ij = a ij ; (2.17) n p ij < 1. (2.18) j=1 In the example considered, we can see that all this is verified in Fig. 2.4 and Fig. 2.5, except the sum of the second row of matrix A that is not inferior to 1, i.e., a 21 + a 22 + a 23 = = (see Fig. 2.3). In order to guarantee that the sum of the second row is inferior to 1, we divide all the elements of the second row by the total sum of that row plus some normalization constant (let us assume 0.8) so the value will be 2 and therefore the second row of V will be filled with 2 (Fig. 2.4). V Figure 2.4: Matrix with value factors v ij for the given example. A Figure 2.5: Example of stop probabilities calculation (bold column). In order to define a probability matrix given by p ij, an extra column in the initial matrix A should be added. This corresponds to the stop probabilities and are defined by the relations (see Fig. 2.5): n p i = 1 p ij (2.19) j=1 Secondly, once all the restrictions are met, the method proceeds in the same way to calculate each element of the inverse matrix. So, we are only going to explain how it works to calculate one element of the inverse matrix, that is the element (B 1 ) 11. As stated in [1], the Monte Carlo method to compute (B 1 ) ij is to play a solitaire game whose expected payment is (B 1 ) ij, and according to a result by Kolmogoroff [9] on the strong law of numbers, if one plays such a game repeatedly, the average 14

31 payment for N successive plays will converge to (B 1 ) ij as N, for almost all sequences of plays. Taking all this into account, to calculate one element of the inverse matrix we will need N plays, with N sufficiently large for an accurate solution. Each play has its own gain, i.e., its contribution to the final result, and the gain of one play is given by GainOfP lay = v i0i 1 v i1i 2 v ik 1 j (2.20) considering a route i = i 0 i 1 i 2 i k 1 j. Finally, assuming N plays, the total gain from all the plays is given by the following expression N (GainOfP lay) k T otalgain = k=1 N p j (2.21) which coincides with the expectation value in the limit N, being therefore (B 1 ) ij. To calculate (B 1 ) 11, one play of the game is explained with an example in the following steps, and knowing that the initial gain is equal to 1: 1. Since the position we want to calculate is in the first row, the algorithm starts in the first row of matrix A (see Fig. 2.6). Then, it is necessary to generate a random number uniformly between 0 and 1. Once we have the random number, let us consider 0.28, we need to know to which drawn position of matrix A it corresponds. To see what position we have drawn, we have to start with the value of the first position of the current row, a 11 and compare it with the random number. The search only stops when the random number is inferior to the value. In this case 0.28 > 0.2, so we have to continue accumulating the values of the visited positions in the current row. Now, we are in position a 12 and we see that 0.28 < a 11 + a 12 = = 0.4, so the position a 12 has been drawn (see Fig. 2.7) and we have to jump to the second row and execute the same operation. Finally, the gain of this random play is the initial gain multiplied by the value of the matrix with value factors correspondent with the position of a 12, which in this case is 1, as we can see in Fig random number = 0.28 A Figure 2.6: First random play of the method. Figure 2.7: Situating all elements of the first row given its probabilities. 2. In the second random play, we are in the second row and a new random number is generated, let us assume 0.1, which corresponds to the drawn position a 21 (see Fig. 2.8). Doing the same reasoning we have to jump to the first row. The gain at this point is equal to multiplying the existent 15

32 value of gain by the value of the matrix with value factors correspondent with the position of a 21, which in this case is 2, as we can see in Fig random number = 0.1 A Figure 2.8: Second random play of the method. 3. In the third random play, we are in the first row and generating a new random number, let us assume 0.6 which corresponds to the stop probability (see Fig. 2.9). The drawing of the stop probability has two particular properties considering the gain of the play, that follow: If the stop probability is drawn in the first random play, the gain is 1; In the remaining random plays, the stop probability gain is 0 (if i j) or p 1 j (if i = j), i.e., the inverse of the stop probability value from the row in which the position we want to calculate is. Thus, in this example, we see that the stop probability is not drawn in the first random play, but it is situated in the same row as the position we want to calculate the inverse matrix value, so the gain of this play is GainOfP lay = v 12 v 21 = 1 2. To obtain an accurate result N plays are needed, with N sufficiently large, and the T otalgain is given by Equation random number = 0.6 A Figure 2.9: Third random play of the method. Although the method explained in the previous paragraphs is expected to rapidly converge, it can be inefficient due to having many plays where the gain is 0. Our solution will take this in consideration in order to reduce waste. There are other Monte Carlo algorithms that aim to enhance the performance of solving linear algebra problems [10, 11, 12]. These algorithms are similar to the one explained above in this section and it is shown that, when some parallelization techniques are applied, the obtained results have a great potential. One of these methods [11] is used as a pre-conditioner, as a consequence of the costly approach of direct and iterative methods, and it has been proved that the Monte Carlo methods 16

33 present better results than the former classic methods. Consequently, our solution will exploit these parallelization techniques, explained in the next subsections, to improve our method. 2.5 Language Support for Parallelization Parallel computing [8] is the use of a parallel computer (i.e., a multiple-processor computer system supporting parallel programming) to reduce the time needed to solve a single computational problem. It is a standard way to solve problems like the one presented in this work. In order to use these parallelization techniques, we need a programming language that allows us to explicitly indicate how different portions of the computation may be executed concurrently by different processors. In the following subsections we present various kinds of parallel programming languages OpenMP OpenMP [13] is an extension of programming languages tailored for a shared-memory environment. It is an Application Program Interface (API) that consists of a set of compiler directives and a library of support functions. OpenMP was developed for Fortran, C, and C++. OpenMP is simple, portable and appropriate to program on multiprocessors. However, it has the limitation of not being suitable for generic multicomputers, since it only used on shared memory systems. On the other hand, OpenMP allows programs to be incrementally parallelized, i.e., a technique for parallelizing an existing program, in which the parallelization is introduced as a sequence of incremental changes, parallelizing one loop at a time. Following each transformation, the program is tested to ensure that its behavior does not change compared to the original program. Programs are usually not much longer than the modified sequential code MPI Message Passing Interface (MPI) [14] is a standard specification for message-passing libraries (i.e., a form of communication used in parallel programming, in which communications are completed by the sending of messages - functions, signals and data packets - to recipients.). MPI is virtually supported in every commercial parallel computer, and free libraries meeting the MPI standard are available for home-made commodity clusters. 17

34 MPI allows the portability of programs to different parallel computers, although the performance of a particular program may vary widely from one machine to another. It is suitable for programming in multicomputers. However, it requires extensive rewriting of the sequential programs GPUs The Graphic Processor Unit (GPU) [15] is a dedicated processor for graphics rendering. It is specialized for compute-intensive, parallel computation, and therefore designed in such way that more transistors are devoted to data processing rather than data caching and flow control. In order to use the power of a GPU, a parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs can be used, CUDA (Compute Unified Device Architecture). This platform is designed to work with programming languages such as C, C++ and Fortran. 2.6 Evaluation Metrics To determine the performance of a parallel algorithm, evaluation is important since it helps us to understand the barriers to higher performance and estimates how much improvement our program will have when the number of processors increases. When we aim to analyse our parallel program, we can use the following metrics [8]: Speedup is used when we want to know how faster is the execution time of a parallel program when compared with the execution time of a sequential program. following: Speedup = Sequential execution time Parallel execution time The general formula is the (2.22) However, parallel programs operations can be put into three categories: computations that must be performed sequentially; computations that can be performed in parallel; and parallel overhead (communication operations and redundant computations). With these categories in mind, the speedup is denoted as ψ(n, p), where n is the problem size and p is the number of tasks. Taking into account the three aspects of the parallel programs, we have: σ(n) as the inherently sequential portion of the computation; ϕ(n) as the portion of the computation that can be executed in parallel; κ(n, p) as the time required for parallel overhead. The previous formula for speedup has the optimistic assumption that the parallel portion of the computation can be divided perfectly among the processors. But if this is not the case, the parallel execution time will be larger, and the speedup will be smaller. Hence actual speedup will be less 18

35 than or equal to the ratio between sequential execution time and parallel execution time as we have defined previously. Then, the complete expression for speedup is given by: ψ(n, p) σ(n) + ϕ(n) σ(n) + ϕ(n)/p + κ(n, p) (2.23) The efficiency is a measure of processor utilization that is represented by the following general formula: Efficiency = Sequential execution time Processors used Parallel execution time = Speedup Processors used (2.24) Having the same criteria as the speedup, efficiency is denoted as ε(n, p) and has the following definition: where 0 ε(n, p) 1. ε(n, p) σ(n) + ϕ(n) pσ(n) + ϕ(n) + pκ(n, p) Amdahl s Law can help us understand the global impact of local optimization and it is given by: (2.25) ψ(n, p) 1 f + (1 f)/p (2.26) where f is the fraction of sequential computation in the original sequential program. Gustafson-Barsis s Law is a way to evaluate the performance as it scales in size of a parallel program and it is given by: ψ(n, p) p + (1 p)s (2.27) where s is the fraction of sequential computation in the parallel program. The Karp-Flatt metric, e, can help decide whether the principal barrier to speedup is the amount of inherently sequential code or parallel overhead and it is given by the following formula: e = 1/ψ(n, p) 1/p 1 1/p (2.28) The isoefficiency metric is a way to evaluate the scalability of a parallel algorithm executing on a parallel computer and it can help us to choose the design that will achieve higher performance when the number of processors increases. The metric says that if we wish to maintain a constant level of efficiency as p increases, the fraction ε(n, p) 1 ε(n, p) (2.29) is a constant C, and the simplified formula is T (n, 1) CT 0 (n, p) (2.30) 19

36 where T 0 (n, p) is the total amount of time spent in all processes doing work not done by the sequential algorithm, and T (n, 1) represents the sequential execution time. 20

37 Chapter 3 Algorithm Implementation In this chapter we present the implementation of our proposed algorithm to obtain the matrix function, all the tools needed, issues found and solutions to overcome them. 3.1 General Approach The algorithm we propose is based on the algorithm presented in Section 2.4. For this reason all the assumptions are the same, except that our algorithm does not have the extra column corresponding to the stop probabilities and the matrix with value factors v ij is in this case a vector v i, where all values are the same for the same row. This new approach aims to reuse every single play, i.e., the gain of each play is never zero, and it is also possible to control the number of plays. It can be used as well to compute more general functions of a matrix. Coming back to the example of Section 2.4, the algorithm starts by assuring that the sum of all the elements of each row is equal to 1. So, if the sum of the row is different from 1, each element of one row is divided by the sum of all elements of that row, and the vector v i will contain the values, value factors, used to normalized the rows of the matrix. This process is illustrated in Fig. 3.1, Fig. 3.2 and Fig B A theoretical ===== results B 1 = (I A) Figure 3.1: Algorithm implementation - Example of a matrix B = I A and A, and the theoretical result B 1 = (I A) 1 of the application of this Monte Carlo method. 21

38 A ======= normalization A Figure 3.2: Initial matrix A and respective normalization. V Figure 3.3: Vector with value factors v i for the given example. Then, once we have the matrix written in the required form, the algorithm can be applied. The algorithm, as we can see in Fig. 3.4, has four main loops. The first loop defines the row that is being computed. The second loop defines the number of iterations, i.e., random jumps inside the probability matrix, and this relates to the power of the matrix in the corresponding series expansion. Then, for each number of iterations, N plays, i.e., the sample size of the Monte Carlo method, are executed for a given row. Finally, the remaining loop generates this random play with the number of random jumps given by the number of iterations. for ( i = 0 ; i < rowsize ; i ++) { for ( q = 0; q < NUM ITERATIONS ; q++) { for ( k = 0; k < NUM PLAYS; k++) { currentrow = i ; vp = 1; for ( p = 0; p < q ; p++) { Figure 3.4: Code excerpt in C with the main loops of the proposed algorithm. In order to better understand the algorithms behavior, two examples will be given: 1. In the case where we have one iteration, one possible play for that is the example of Fig That follows the same reasoning as the algorithm presented in Section 2.4, except for the matrix element where the gain is stored, i.e., in which position of the inverse matrix the gain is accumulated. This depends on the column where the last iteration stops and what is the row where it starts (first loop). The gain is accumulated in a position corresponding to the row in which it started and the column in which it finished. Let us assume that it started in row 3 and ended in column 1, the element to which the gain is added would be (B 1 ) 31. In this particular instance, it stops in the second column while it started in the first row, so the gain will be added in the element (B 1 ) When we have two iterations, one possible play for that is the example of Fig. 3.6 for the first 22

39 random number = 0.6 A Figure 3.5: Example of one play with one iteration. iteration, and Fig. 3.7 for the second iteration. In this case, it stops in the third column and it started in the first row, so the gain will count for the position (B 1 ) 13 of the inverse matrix. random number = 0.7 A Figure 3.6: Example of the first iteration of one play with two iterations. random number = 0.85 A Figure 3.7: Example of the second iteration of one play with two iterations. Finally, after the algorithm computes all the plays for each number of iterations, if we want to obtain the inverse matrix, we must retrieve the total gain for each position. This process consists in the sum of all the gains for each number of iterations divided by the N plays, as we can see in Fig for ( i =0; i < rowsize ; i ++) { for ( j =0; j < columnsize ; j ++) { for ( q=0; q < NUM ITERATIONS ; q++) { inverse [ i ] [ j ] += aux [ q ] [ i ] [ j ] ; } } } for ( i =0; i < rowsize ; i ++) { for ( j =0; j < columnsize ; j ++) { inverse [ i ] [ j ] = inverse [ i ] [ j ] / ( NUM PLAYS ) ; } } Figure 3.8: Code excerpt in C with the sum of all the gains for each position of the inverse matrix. The proposed algorithm was implemented in C, since it is a good programming language to manipulate the memory usage, and it provides language constructs that efficiently map machine in- 23

40 structions as well. One other reason is the fact that it is compatible/adaptable with all the parallelization techniques presented in Section 2.5. Concerning the parallelization technique, we used OpenMP since it is the simpler and easier way to transform a serial program into a parallel program. 3.2 Implementation of the Different Matrix Functions The algorithm we propose, depending on how we aggregate the output results, is capable of obtaining different matrix functions as a result. In this thesis, we are interested in obtaining the inverse matrix and the matrix exponential, since these functions give us important complex networks metrics: node centrality and node communicability, respectively (see Section 2.1). In Fig. 3.9, we can see how we obtain the inverse matrix of one single row, according to Equation 2.2. And in Fig we can observe how we obtain the matrix exponential taking into account Equation 2.3. If we iterate this process for a number of times equivalent to the number of lines (1 st dimension of the matrix), we get the results for the full matrix. for ( j = 0 ; j < columnsize ; j ++) { for ( q = 0 ; q < NUM ITERATIONS ; q ++) { inverse [ j ] += aux [ q ] [ j ] ; } } for ( j = 0 ; j < columnsize ; j ++) { inverse [ j ] = inverse [ j ] / ( NUM PLAYS ) ; } Figure 3.9: Code excerpt in C with the necessary operations to obtain the inverse matrix of one single row. for ( j = 0 ; j < columnsize ; j ++) { for ( q = 0 ; q < NUM ITERATIONS ; q ++) { exponential [ j ] += aux [ q ] [ j ] / f a c t o r i a l ( q ) ; } } for ( j = 0 ; j < columnsize ; j ++) { exponential [ j ] = exponential [ j ] / ( NUM PLAYS ) ; } Figure 3.10: Code excerpt in C with the necessary operations to obtain the matrix exponential of one single row. 24

Combinatorial Search; Monte Carlo Methods

Combinatorial Search; Monte Carlo Methods Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 02, 2016 CPD (DEI / IST) Parallel and Distributed