Block of PEs used in computing A[k, N-l] Block of PEs used in computing A[k, l] A[k, N-l] A[N-k, N-l] Block of PEs used in computing A[N-k, l]

Size: px

Start display at page:

Download "Block of PEs used in computing A[k, N-l] Block of PEs used in computing A[k, l] A[k, N-l] A[N-k, N-l] Block of PEs used in computing A[N-k, l]"

Alban O’Neal’
6 years ago
Views:

1 A Parallel Algorithm with Embedded Load Balancing for the Computation of Autocorrelation Matrix S.R. Subramanya Department of Electrical Engineering and Computer Science, The George Washington University, Washington, DC Abstract. The computation of autocorrelation matrix is used heavily in several areas including signal and image processing, where parallel architectures are also being increasingly used. Therefore, an ecient scheme to compute autocorrelation matrix on parallel architectures has tremendous bene- ts. In this paper, an ecient parallel algorithm for the computation of autocorrelation matrix on 2-D mesh is presented. The computation requirements for the elements of the autocorrelation matrix is highly skewed and the proposed algorithm attempts to balance the computation load, without requiring an external load balancing algorithm or processor. In this sense, the load balancing is embedded within the algorithm. Communication and computation complexities are analyzed separately. The proposed algorithm is shown to provide speedup of up to 375% over the straight-forward parallel algorithm. Keywords: Autocorrelation matrix, Parallel algorithm, 2-D Mesh, Load balancing. 1 Introduction The computation of autocorrelation matrix is central to several applications including signal and image processing. For example, it is used for the computation of the coecients of the ARMA (autoregressive moving average) model, used for modeling stationary signals. Non-stationary signals are sometimes approximated by considering windows of the signal and modeling the signal windows as a stationary signal with suitable parameters. Given a matrix X = (x i;j ) 0i;jN?1, the autocorrelation matrix is given by A = (a k;l ) 0k;lN?1, where a k;l = 1 (N? k)(n? l) N?1?k X i=0 N?1?l X j=0 x i;j x i+k;j+l We consider a 2-D mesh as the architecture for which parallel algorithms for autocorrelation matrix computation will be developed. The basic architecture consists of processing elements (PEs) arranged in a 2-D array with nearest-neighbor interconnections. Each PE has a simple structure capable of arithmetic operations and basic communication capabilities such as send and receive, to and from the directly connected neighbors, and has local memory for its exclusive use. In the subsequent discussion, the terms PE and processor are used synonymously. Since the proposed algorithm has the same asymptotic complexity as the straight-forward parallel algorithm, we compute the exact number of steps required by the two algorithms. Computation and communication steps are separately computed. The next section gives the notations and assumptions used in rest of the paper, Sections 3 and 4 describe and analyze, respectively, the straight-forward parallel algorithm and the proposed algorithm, the speedup of the proposed algorithm is given in Section 4, followed by conclusion.

2 2 Notations and Assumptions Algorithm 3.1 ParAutoCorr (X, A) In the algorithms and discussion to follow, the following notations and assumptions are used. The input matrix X is of size N N, as is the size of the 2-D mesh used in the computation and of matrix A which holds the autocorrelation matrix resulting from the computation. The indices for both the matrix and the PEs range from 0 to (N? 1). Each PE (i; j) initially contains X[i; j] and there is enough local memory at each processor to hold the entire matrix X. An all-to-all broadcast is done and the entire matrix X is built in each processor [2]. Computation of autocorrelation matrix then proceeds and at the end of the computation, each PE (i; j) will contain the element A[i; j]. 2. for k = 0 to N? 1 do 3. for l = 0 to N? 1 do 4. for i = 0 to N? 1? k pardo 5. for j = 0 to N? 1? l pardo 6. P (k + i; l + j) (does): S[k + i; l + j] X[i; j] X[i + k; j + l]; 7. endfor 8. endfor 9. for i = k to N? 1 pardo 10. Sum all S[i; j]; l j N? 1 using parallel reduction, and store the result in S[i; l]. 11. endfor 12. Sum all S[i; l]; k i N? 1 using parallel reduction, and store the result in A[k; l]. 13. endfor 1for 15. end 3.1 Analysis of the Straight-Forward Parallel Algorithm 3 The Straight-Forward Parallel Algorithm We give below the straight-forward parallel algorithm to aid in the understanding of the underlying parallelism and analyze its communication and computation complexity. The straight-forward parallel algorithm is described below: It is easily seen that the computation requirements of the autocorrelation matrix elements is highly skewed: a 0;0 takes N 2 multiplications, N 2? 1 additions, and 2(N? 1) communication steps, while a N?1;N?1 takes just 1 multiplication and no addition and communication steps, and the number of steps required for the remaining elements is something in between. Thus the load on the processors is highly skewed, which is the motivating factor for our proposed algorithm with embedded load balancing. Since the proposed algorithm has the same asymptotic complexity as the straight-forward parallel algorithm, we compute the exact number of steps taken by the two algorithms for computation and communication, and compare them. Note that for the computation of an element

3 A[k; l], processors in the rectangular block with the top left corner at (k; l) and bottom right corner (N? 1; N? 1) (with height H = N? k and width W = N? l) operate in parallel and the result is stored in PE: P (k; l). All the multiplications are carried out in parallel in one step. The additions require dlog W e + dlog He steps and (N? 1? l) + (N? 1? k) steps. The exact time for computation (derived in [2]) is: N 2 + 2N[Nlog N? N + 1] The time for communications consists essentially of the product terms to move from PE to PE for addition. This requires a communication time of (N? l? 1) + (N? k? 1), which is the time for the product term from the farthest PE, P (N? 1; N? 1) to reach P (k; l). So, the total communication time is shown (in [2]) to be: N 2 (N? 1) So, the execution time, which is the sum of computation and communication times, of the straight-forward parallel algorithm is: N 2 + 2N[Nlog N? N + 1] + N 2 (N? 1) the corresponding blocks of processors used during each of the phases. Elements Physical block of processors Ph. computed used in computation concurrently Top left corner Height Width 1 (0; 0) (0; 0) N N 1 l N=2 2 (0; l) (0; 0) N (N? l) (0; N? l) (0; N? l) N l 1 k N=2 3 (k; 0) (0; 0) (N? k) N (N? k; 0) (N? k; 0) k N 1 k; l N=2 4 (k; l) (0; 0) (N? k) (N? l) (k; N? l) (0; N? l) (N? k) l (N? k; l) (N? k; 0) k (N? l) (N? k; N? l) (N? k; N? l) k l Table 1: Blocks of processors used in the computation of elements of A The computation pattern is easy to visualize and Figure 1 shows pattern for the fourth phase. Block of PEs used in computing A[k, l] A[k, l] Block of PEs used in computing A[k, N-l] A[k, N-l] = N 3 + 2N[Nlog N? N + 1] 4 Parallel Algorithm with Embedded Load Balancing In the above algorithm each a k;l uses a block of PEs working in parallel. Since the computation and communication requirements of the elements are highly skewed, we propose an algorithm which does concurrent computation of the a k;l 's, using dierent blocks of processors working in parallel for dierent a k;l 's, without block interference. Since attempt is made to balance the load on each PE without using any external load balancing algorithm or processor, we say that the load balancing is embedded in the algorithm. In this algorithm N is assumed to be odd, although only a slight modication is required for even N. The algorithm has four phases and Table 1 below summarizes the elements computed and A[N-k, l] Block of PEs used in computing A[N-k, l] Figure 1: A[N-k, N-l] Block of PEs used in computing A[N-k, N-l] Pattern of computation in the proposed algorithm (fourth phase). The dark nodes indicate the element being computed and also the node where the result will be stored. The rectangles enclosing the dark nodes indicate the block of PEs which participate in the computation of the corresponding element. With this scheme, the all

4 the multiplications for all the four elements can be done in one step. Addition of all the product terms is then done by a series of communication and addition steps. Algorithm 4.1 ParAutoCorrWithLoadBal (X; A) fphase 1g 2. Compute a 0;0. fphase 2g 3. for l = 1 to N=2 do 4. Do Concurrently: 5. Compute a 0;l and Compute a 0;N?l. 6. enddo 7. endfor fphase 3g 8. for k = 1 to N=2 do 9. Do Concurrently: 10. Compute a k;0 and Compute a N?k; enddo 12. endfor fphase 4g 13. for k = 1 to N=2 do 14. for l = 1 to N=2 do 15. Do Concurrently: 16. Compute a k;l, Compute a k;n?l, Compute a N?k;l, and Compute a N?k;N?l. 17. enddo 18. endfor 19. endfor 20. end Algorithm 4.2 Compute a k;l 2. MultiplyBlock (k; l; N? k; N? l; 0; 0). 3. AddBlock (k; l; N? k; N? l; 0; 0). Compute a k;n?l 2. MultiplyBlock (k; N? l; N? k; l; 0; N? l). 3. AddBlock (k; N? l; N? k; l; 0; N? l). Compute a N?k;l 2. MultiplyBlock (N? k; l; k; N? l; N? k; 0). 3. AddBlock (N? k; l; k; N? l; N? k; 0). Compute a N?k;N?l 2. MultiplyBlock (N? k; N? l; k; l; N? k; N? l). 3. AddBlock (N? k; N? l; k; l; N? k; N? l). Algorithm 4.3 (k; l; H; W; x; y) MultiplyBlock Block of processors with top left corner at (x; y) and of height H and width W are used in the computation of element a k;l. 2. for i = 0 to H? 1 pardo 3. for j = 0 to W? 1 pardo 4. P(x+i,y+j) (does): 5. prod X[i; j] X[k + i; l + j]; 6. endfor 7. endfor 8. end

5 Algorithm 4.4 (k; l; H; W; x; y) AddBlock total time. Communication time T c required for the addition of all elements of any row in a processor block of width W, with the result to be stored in column l is given in Table 2. Block of processors with top left corner at (x; y) and of height H and width W are used in the computation of element a k;l. 2. for i = 0 to H? 1 pardo 3. for m = 1 to dlog W e do 4. Sum in parallel, the terms in nodes that are distance m apart in such a way that the sum of any row i is available in PE (i; l) at the end. 5. endfor 6. endfor f Sum of row i will be available in A[i; l]g 7. for m = 1 to dlog He do 8. Sum in parallel, the terms in nodes of column l that are distance m in such a way that the sum of column l is available in PE (k; l) at the end. 9. endfor f Sum of column l will be available in A[k; l]g 10. end Column index of result: l l < W=2 l = W=2 l > W=2 Communication steps: T c W? 1? l l + (dw=2e? bw=2c) l Table 2: Communication steps as a function of the index of element being computed A similar table holds for the communication time required for the addition of elements of any column in a processor block of height H, with the result to be stored in row k. This table is central in the computation of communication times during the additions. In phase four, during the concurrent computation of a k;l, a k;n?l, a N?k;l, and a N?k;N?l, the corresponding blocks of PEs work independently of other blocks and the communication time is dominated by the communication time required during the computation of a k;l and hence will suce to calculate only that. The computation and communication steps for the four phases are derived in [2], and results are tabulated below. 4.1 Analysis of the Proposed Algorithm We determine the time taken by the proposed algorithm by computing the exact number of steps required by computation and communication. It should be noted that during the computation of any a k;l, all the required multiplications are done in one step and the remaining time is taken by the additions requiring addition and communication steps. So, the total multiplication steps for all a k;l 's being N 2, we consider only the addition times in further calculations and add N 2 at the end, to get the Phase Computation Time dlog Ne +P N 2, 3 dlog Ne (N?1) dlog ie 2 i=n=2 P N?1 4 N dlog ie i=n=2 Phase Communication Time 1 2(N? 1) N 2, 3 (19N? 26) N 2 (N? 2)

6 The total computation steps P is: 1 + N?1 2dlog Ne+Ndlog Ne+(N +2)( dlog ie). i=n=2 The total communication time is: 2(N? 1) + N 24 (19N? 26) + N 24 (19N? 26) N 2 (N? 2) = 7 24 N 3 + N 2? N 6? 2. The execution time for the proposed algorithm is the sum of its computation and communication times. The speedup provided by the proposed algorithm over the straightforward parallel algorithm is (shown in [2] to be): > N 3 + 2N 2 log N? 2N 2 + 2N 7 N 3 + N 2? N + (2N + 4)log N 24 6 References [1] Akl, S.G. Design and Analysis of Parallel Algorithms, Prentice-Hall, [2] S.R.Subramanya. `A Parallel Algorithm with Embedded Load Balancing for the Computation of Autocorrelation Matrix on 2-D Meshes', (unpublished manuscript). [3] Hedetniemi, S.M. `A Survey of Gossiping and Broadcasting in Communication Networks', Networks, Vol. 18, 1988, pp [4] Brockwell, P.J. and Davis, R.A. Time Series: Theory and Methods, Springer-Verlag, The speedup is shown in Figure 2 for various values of N (matrix size) Speedup Matrix size Figure 2: Speedup of the proposed algorithm over straight-forward parallel algorithm. 5 Conclusions In this paper, a parallel algorithm for the computation of autocorrelation matrix on a 2-D mesh was presented. Load balancing was embedded in the algorithm. The computation and communication times were separately computed. The proposed algorithm was shown to provide a speedup of up to 375% over the straight-forward parallel algorithm.

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is