PAPER Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination

Size: px

Start display at page:

Download "PAPER Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination"

Lucy Fox
5 years ago
Views:

1 1503 PAPER Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination Shietung PENG and Stanislav G. SEDUKHIN Nonmembers SUMMARY The design of array processors for solving linear systems using two-step division-free Gaussian elimination method is considered. The two-step method can be used to improve the systems based on the one-step method in terms of numerical stability as well as the requirements for high-precision. In spite of the rather complicated computations needed at each iteration of the two-step method we develop an innovative parallel algorithm whose data dependency graph meets the requirements for regularity and locality. Then we derive two-dimensional array processors by adopting a systematic approach to investigate the set of all admissible solutions and obtain the optimal array processors under linear time-space scheduling. The array processors is optimal in terms of the number of processing elements used. key words: linear system parallel algorithm parallel architecture systolic array processors 1. Introduction In the area of high-performance computing the design of division-free array processors has been attracting research attention recently [5] [6] [8] [10]. The main reason is that the division unit in high-performance special-purpose processors are both time and space consuming. Besides as far as numerical stability is concerned the cumulative effect of roundoff error for division operations also makes division-free algorithms much more appealing. The one-step division-free Gaussian elimination method was used in [8] [10] to develop optimal array processors. This method may drive the system to highprecision requirements and instability as described in [1] [3]. The problem arises from the fact that the absolute values of elements of the updated matrix increase rapidly and eventually reduce the numerical stability of the algorithm for large input matrices. To increase numerical stability Bareiss proposed a multistep division-free Gaussian elimination method and showed that the multi-step method gives better numerical stability than the one-step method [1]. Because the formulation of the multi-step method is irregular and complicated it is a nontrivial task to design an efficient parallel algorithm and a corresponding array processor based on this method. In this paper we design a parallel algorithm of the two-step method through proper partitioning and Manuscript received July Manuscript revised June The authors are with University of Aizu Aizu- Wakamatsu-shi Japan. re-indexing. In the design process we show how to circumvent the irregularity of the original algorithm stepby-step. Then highly-parallel array processors based on this parallel algorithm are systematically designed and analyzed. Two optimal 2-D array processors in terms of the number of PEs are shown in this paper. The key features of these array processors are (1) division-free and (2) the numerical stability enhanced by the twostep method compared with its counterpart based on the one-step method. The rest of the paper is organized into five sections. Section 2 gives some background for the twostep division-free Gaussian elimination method. A new parallel algorithm based on the two-step method is presented in Sect. 3. In Sect. 4 the 3-D data dependency graph of the new algorithm and its analysis are given. In Sect. 5 two optimal 2-D array processors based on the new algorithm and a systematic approach [9] are described and their performance is investigated. Finally in the last section concluding remarks and further research directions are discussed. 2. Two-Step Division-Free Gaussian Elimination Let a generalized linear system of equations be given by AX B where A [a ] 1 i j n B [b ] 1 i n n +1 j m and X [x ] 1 i n 1 j m n. To solve AX B the matrix A should be reduced to diagonal form or triangular form with subsequent back substitution. In general systolization of the algorithm which reduces A to diagonal form is more difficult than to a triangular one. In this paper we consider the algorithm to reduce the matrix A to diagonal form using two-step Gaussian elimination method. The simplest (one-step) division-free algorithm for diagonalization is given by the following recurrence formula: a 1 i j n; b 1 i n n +1 j m; if i k; otherwise; 1 k n 1 i n k j m;

2 1504 A is a determinant of matrix A; ii ii 1 k n 1 i k 1. (1) Notice that 0 for i j < k and i j. T he advantage of this formula is the absence of division operations. Hence the division-free algorithm avoids division round-off error and thus is more numerically stable than the classical Gaussian elimination algorithm. However the one-step equation (1) suffers from rapid increase of absolute values of the updated matrix eventually requires high-precision computation for large input matrices. To circumvent this problem a multistep approach has been proposed by Bareiss [1]. The equations for the two-step division-free Gaussian elimination is given as follows: a 1 i n 1 j m; if i k or k 1; a(k 1) if i k; if i k 1; n k i n k 1 j m; 2 ii ii n k i k 2; 2 if n is odd a (n) a (n 1) nn a (n 1) nj a (n 1) in a (n 1) 1 i n 1 n+1 j m. (2) It is instructive to obtain Eq. (2) directly from (1) by applying the one-step equation twice and simplifying the result. The simplified result can be expressed as follows:. (3) Disregarding the factor in Eq. (3) yields (2). Therefore the coefficients of (2) are smaller by a factor and in addition can be obtained from more efficiently than those of (1) because some terms cancel and need not be calculated. The effect of the above transformation on the matrix [ ] can be described as follows. The first equation of (2) for when formally extended to j k 1 and j k reduces the elements i > k and i<k 1 to zero for two columns and leaves the elements and a(k 2) unchanged. It remains to transform and a(k 2) to zero. However once the elements i > k have been determined the element can be transformed to zero by (1) for updating as shown in (2). Similarly the element can be transformed to zero by (1) for updating. Notice that if n is odd then additional computations to transform the element a (n 2) in i n to zero by (1) for a (n 1) i n n j m are needed. 3. A Parallel Algorithm Regularity is one of the most important factors for designing parallel algorithms and architectures. In [3] it was shown that the irregularity makes the parallelism hard to expose. In [7] Kung stated the importance of the regularity and modularity for designing systolic arrays. In this paper we define explicitly the regularity of a parallel algorithm. Intuitively a regular parallel algorithm is the one that can be used to construct array processors systematically. The definition of the regularity of an indexed algorithm is based on the data dependency graph (DDG) of the algorithm defined below (also see [8]). Index space of an indexed algorithm is the set of all index points p (i j k) T in 3-D case where each index point is associated with a single computation. Data dependency vector (DDV) is the difference between the index point where a variable is used as input variable and the index point where that variable is generated as output variable. DDG of an indexed algorithm is an directed graph with index points as vertices and DDVs as edges. A parallel algorithm is regular if its DDG satisfies the following two conditions: 1) there do not exist any opposite edges in any dimension; 2) there does not exist any cycle. The main concerns in the design of a regular parallel algorithm using the two-step algorithm are the computations for i k or k 1 in Eq. (2) since it involves quite complicated computation patterns. We will show how to reformulate this part of computations step-bystep by partitioning and re-indexing techniques.

3 PENG and SEDUKHIN: DESIGN OF OPTIMAL ARRAY PROCESSORS 1505 First to simplify the computation and to reduce the number of multiplications three indexed variables c(k 2) are introduced to hold the intermediate results of the computations in the two-step method. We assume that an indexed variable v k is held at index point p (i j k) T. + c(k 2) 1 i n i k 1 i k k +1 j m. (4) and c(k 2) in (4) for k 2. Since the data item i1 is needed for computing c (0) i2 and the data item a(0) i2 is needed for computing c (0) there are bidirectional edges between index Consider the computations of i1 points p i (i 1 0) T and q i (i 2 0) T 1 i n i 1 i 2. That is there are cycles in the corresponding DDG. To solve this problem we divide the computations into two layers and put the variables and in points (i k k 1) T and (i k k) T respectively to eliminate the cycles. Therefore the first step of our algorithm is to re-index the variables and into c (k) and c(k 1) respectively and partition the computations at each iteration into two layers as following to guarantee that the corresponding DDG is acyclic. Notice that after this partitioning the computation at each index point p (i j k) T of the algorithm involves up to two multiplications and one addition/subtraction. c (k) + c (k) c(k 1) where ; 1 i n i k 1 i k k +1 j m. (5) Next to eliminate the opposite edges in i and j directions in the DDG of Eq. (5) we adopt the similar technique used in [8]. We shift the (k 1)th and kth rows to the (n + k 1)th and (n + k)th rows in the (k 1)th and kth layers respectively. The resulting algorithm is shown in Eq. (6). k +1 i n + k 2; k +1 i n + k 2 j>k; k +1 i n + k 2; c (k) + c(k 1) c (k) where. (6) Then we properly distribute the other one-step computations in Eq. (2) into the (k 1)th and/or kth layers. If n is odd then 2 n 2 2n 1. In this case the nth layer is constructed using the one-step algorithm. Finally to make the diagonal elements a (n) ii next to the elements a (n) in+1 for effective data transmission and computations at the last step of the algorithm we rearrange the initial data to reserve the (n + 1)th column as free space and shift the (k 1)th and kth diagonal elements into this column in the (k 1)th and kth layers respectively. After diagonalization of matrix A one more step (a division step) is used to get the solution of the generalized linear systems. A complete description of the regular two-step division-free (2SDF) algorithm for solving generalized linear systems AX B is shown below. Algorithm 2SDF begin /*Initialization*/ forall 1 i n 1 j n do a ; /*Reserve the (n + 1)th column as free space*/ forall 1 i n n +2 j m +1do a ij 1 ; /*Internal computations*/ for k n 2 do begin k /*Computation at the (k 1)th layer*/ a(k 2) a(k 2) ; (a) type

4 1506 forall k +1 i n + k 2 do a(k 2) a(k 2) kk ; (b) type n+k 1n+1 a(k 2) ; forall k 1 i n + k 2 j k 1 or k do ; forall k +1 i n + k 2 k+1 j m +1 j n +1do + c(k 1) ; (c) type forall k +1 j m +1j n +1do begin a(k 1) a(k 1) ; (d) type n+ a(k 2) kk ; end /*Computation at the kth layer*/ forall k +1 i n + k 2 do c (k) a(k 1) a(k 1) (e) type a(k 1) ; (f) type n+kn+1 c(k 1) ; forall k +1 j m +1j n +1do n+kj a(k) ; forall k +1 i n + k 2 k+1 j m +1 j n +1do c (k) forall n +1 i n + k 2 do a(k 1) ; (g) type /*Computation for ii in+1 c(k 1) in+1 ; forall k j m +1j n +1do n+ a(k 1) kj ii */ (h) type ; (i) type 4. Data Dependency Graph of Algorithm 2SDF In this section we derive a localized DDG of Algorithm 2SDF and then analyze it. First we need a strategy of data pipeline for shared variables so that computations in Algorithm 2SDF involve only local data transfer. The strategy is straightforward and the resulting localized DDG are shown in Figs. 1 and 2 for the (k 1)th and kth layers respectively. If n is an odd integer then one extra layer (the nth layer) is needed. This extra layer of the localized DDG for odd n is shown in Fig. 3. Finally the (n + 1)th layer of the DDG for the output computations is shown in Fig. 4. From Algorithm 2SDF it is easy to see that nine different types of nodes are needed for computations in the (k 1)th and kth layers. The function and the end k /*Computation at the nth layer when n is an odd integer*/ if n is odd then forall n<i 2n 1 n+1<j m +1do begin a (n) a (n 1) nn a (n 1) a (n 1) nj a (n 1) in ; /*Computation for a (n) ii a (n) in+1 a(n 1) nn a (n 1) in+1 ; a (n 1) nn a (n 1) ii */ end /*Output computations*/ forall n +1 i 2n n +2 j m +1do end x (n+1) i nj n 1 a(n) ij /a(n) in+1 ; Fig. 1 The (k 1)th layer of DDG. Fig. 2 The kth layer of DDG.

5 PENG and SEDUKHIN: DESIGN OF OPTIMAL ARRAY PROCESSORS 1507 location of each type of nodes in the DDG are listed as follows (see Algorithm 2SDF and Fig. 5). Type (a) node: compute (k k k 1) T ; Type (b) nodes: compute (i k k 1) T k+1 i n + k 2; Type (c) nodes: compute (i j k 1) T k +1 i n + k 2 k +1 j m +1j n +1; Type (d) nodes: compute (k j k 1) T k+1 j m +1j n +1; Type (e) nodes: compute n+ (n + k 1jk 1) T k+1 j m +1j n +1; Type (f) nodes: compute c (k) (i k k) T k+1 i n + k 2; Type (g) nodes: compute (i j k) T k +1 i n + k 2 k +1 j m +1j n +1; Type (h) nodes: compute in+1 (i n +1k) T k+1 i n + k 2; Type (i) nodes: compute n+ (n+k) T k+1 j m+1j n+1. Notice that empty nodes are included in the DDG for proper local transmission of shared data. Let the index space of the algorithm be P. Then Fig. 3 The nth layer of DDG if n is odd. Fig. 4 The (n+1)th layer of the DDG for output computation. Fig. 5 The nine types of the nodes at the (k 1)th and kth layers of the DDG.

6 1508 P can be expressed as P in P int P out where P in P int P out are the index sets of input internal and output computations respectively. From Algorithm 2SDF and the localized DDG we have where P in {(i j 0) T 1 i n 1 j m +1 j n +1} Z 2 {0}; n k (P k 1 P k ) 1 k 2 k 2k P int if n is even; n k (P k 1 P k ) P odd 1 k 2 k 2k otherwise P k 1 {(i j k 1) T k 1 i n + k 1 k 1 j m +1} P k {(i j k) T k i n + k k 1 j m +1} P odd {(i j n) T n i 2n n j m +1}; P out {(i j n +1) T n +1 i 2n n +1 j m +1} Z 2 {n +1}. Next we show that the number of multiplications in Algorithm 2SDF is N 3n 2 (2m n)/4+o(n 2 + mn). It is easy to see that the number of multiplications in Algorithm 2SDF is dominated by the computation of (c) and (g) types That is the number of multiplications in the computation of all other types is O(n 2 + mn). The computation of (c) and (g) types have the same number of iterations for each fixed k which is (n 2)(m k) with 3 multiplications together each iteration. Therefore the total number of multiplications for the computation of (c) and (g) types is n/2 3(n 2)(m 2l) l1 n/2 3(n 2)nm/2 6(n 2) l l1 3n 2 m/2 3n 3 /4+O(mn + n 2 ). This completes the proof. The longest path in the DDG is ρ p min p max where p min (1 1 1) T and p max (2n m +1n+1) T. Therefore we have ρ 3n + m 1 where ρ the length of the path ρ is the Manhattan distance between p min and p max. A timing (step) function step(p): P int Z + which assigns a computational time step to each index point p P int is defined as follows step(p) i + j + k 3. The step function can also be specified in the linear form as step(p) λ p + γ where λ (1 1 1) and γ 3. This function defines a set of hyperplanes orthogonal to the schedule vector λ on the index space of the algorithm. Equal values of timing function are shown by dashed lines in Figs The minimal computation time of the algorithm is T (m n) T min (DDG) step(p max ) 3n + m 1 assuming step(p min ) 0. From the discussion above we have T (m n) ρ. The allocation function place(p): Z 3 Z 2 is defined in the linear form: place(p) Λ η p where Λ η is a (2 3) matrix of the linear transformation corresponding to a projection vector η ker Λ η. Notice that for obtaining the correct input/output data flows in 2-D array processors we have to find new positions of input/output matrix elements in the 3-D index space i.e. redefine P in and P out domains [9]. 5. Design of Optimal Array Processors Many 2-D array processors can be derived by projecting the localized 3-D DDG along different admissible projection vectors. A projection vector η is admissible if and only if λ η 0. This condition guarantees that each PE executes at most one computation (one index point in the DDG) at any given time step. In order to find an optimal design of 2-D array processor from the given localized DDG we need to select from the space of the all admissible solutions based on some integrated criteria such as number of PEs data pipelining period computation time array topology number of I/O ports etc. In this paper for the given algorithm (and its DDG) we say that an array processor is optimal if (1) it uses minimum number of PEs among all array processors obtained by admissible projections and (2) its total computation time equals to the length of the longest path in the DDG (3n + m 1 in this case). It can be shown that for the localized DDG we adopted there are 13 admissible projections from 17 possible ones [9]. After having obtained and tested all admissible array processors it can be shown that two solutions generated by projecting the localized DDG along i axis and j axis have the minimal number of PEs. Each of the two array processors holds minimal number of PEs in certain range of m and n. We will discuss this in the rest of the section. The 2-D array processor S (010) that is generated by mapping the localized DDG along η 1 (0 1 0) T direction is shown in Fig. 6 for the case of n 5 and m 7. This array processor is an array of PEs and programmable register-latches (PRLs). The number of

7 PENG and SEDUKHIN: DESIGN OF OPTIMAL ARRAY PROCESSORS 1509 Fig. 6 The optimal array processor S (010) and its PE types. PEs in this array processor is N (010) P n 2 + n 2 if n is even; n (2n 1) otherwise 2 which is independent of m. It is easy to verify from the DDG and the selected projection that there are eight types of PEs and four types of PRLs. The eight types of PEs are depicted in Figs. 6 (a) (h). Functional description of each type of PE can be obtained easily from Figs. 5 (a) (i) as well as projection of Figs. 1 4 along direction η 1. The rhombic array processor S (010) simulates the 3-D DDG without time extension i.e. it solves a single task in time T (010) (1nm)T min (DDG) 3n + m 1. The data pipelining period defined as the time interval separating the neighboring items of input or output data is α λ η 1 1 and the block pipelining period defined as the time interval between the initiations of two successive task instances is β m + 1 i.e. the next task can be pushed into this array processor after m + 1 time steps. The number of I/O ports is 2n. Also it is evident that for l tasks T (010) (l n m) T (010) (1mn)+(l 1)β 3n + l(m +1) 2. Another optimal design can be obtained by mapping the DDG along direction η 2 (1 0 0) T. T he corresponding array processor PE types and input/output data flows are shown in Fig. 7 for the case of n 5 and m 7. The number of PEs of the array is mn n2 N (100) P 2 + m n if n is even 2 n (2m n) otherwise. 2 There are eight types of PEs (see Figs. 7 (a) (h)) and only one type of register-latches. It is not difficult to show that N (100) P N (010) P { } 3n 1 if m min 3n2 +2n 3n (n +1) 2 It means that array processor S (100) is optimal in term of number of PEs if m 3n 1 2. The functions carried in each of the eight types of

8 1510 Fig. 7 The optimal array processor S (100) and its PE types. PEs are relatively simple considering the complexity of Algorithm 2SDF. The computations carried on each time step are no more than two multiplications and one addition/subtraction. The total computation time of this array is T (100) (1mn)T min (DDG) 3n + m 1. The data pipelining period α equals to λ η 2 1. The block pipelining period β equals to m i.e. for l tasks T (100) (l n m) 3n + lm Concluding Remarks The design of new array processors in this paper shows that a numerical algorithm which involves rather high computational irregularity can be arranged to perform pipelined computations in an elegant way. The array processors designed in this paper is a step towards designing efficient special-purpose array processors based on an irregular algorithm namely two-step divisionfree Gaussian elimination method. The techniques used in our design should be applicable to other numerical algorithms with rather complicated structure. For example with minor modification the techniques used here can also be used to develop array processors for two-step fraction-free (integer-preserving) solutions for linear systems with integer coefficients. Another direction for further research is to consider asynchronous array models. The asynchronous arrays e.g. wavefront array processors [7] are good alternatives for efficiency and flexibility of massively parallel processing. Finally to map the proposed parallel algorithm effectively onto existing massively parallel computers using proper partitioning techniques is also worth further research [2]. Acknowledgment We would le to thank the reviewers for their constructive comments to improve the quality of this paper.

PENG and SEDUKHIN: DESIGN OF OPTIMAL ARRAY PROCESSORS 1511 References [1] E.H. Bareiss Sylvester s identity and multistep integerpreserving Gaussian elimination Mathematics of Computation vol.22 pp.

D. Williams and P.C. Messina Parallel Computing Works Morgan Kaufmann Publishers San Francisco 1994. [4] L. Fox An Introduction to Numerical Linear Algebra Clarendon Press Oxford 1964. [5] E.N. Frantzeskakis and K.

9 PENG and SEDUKHIN: DESIGN OF OPTIMAL ARRAY PROCESSORS 1511 References [1] E.H. Bareiss Sylvester s identity and multistep integerpreserving Gaussian elimination Mathematics of Computation vol.22 pp [2] X. Chen and G.M. Megson A general methodology of partitioning and mapping for given regular arrays IEEE Trans. Parallel & Distributed Systems vol.6 no.10 pp [3] G.C. Fox R.D. Williams and P.C. Messina Parallel Computing Works Morgan Kaufmann Publishers San Francisco [4] L. Fox An Introduction to Numerical Linear Algebra Clarendon Press Oxford [5] E.N. Frantzeskakis and K.J.R. Liu A class of square root and division-free algorithms and architectures for QRDbased adaptive signal processing IEEE Trans. Signal Processing vol.42 no.9 pp [6] J. Gotze and J. Schwegelshohn A square root and division-free Givens rotation for solving least squares problems on systolic arrays SIAM J. Sci. Statist. Comput. vol.12 pp July [7] S.Y. Kung VLSI Array Processors Prentice Hall [8] S. Peng and S.G. Sedukhin Array processors design for division-free linear system solving The Computer Journal vol.39 no.8 pp [9] S.G. Sedukhin and I.S. Sedukhin Systematic approach and software tool for systolic design Lecture Notes in Computer Science vol.854 pp [10] S.G. Sedukhin An algorithm and array processors for solving the systems of linear equations Proc. Intern. Conf. Parallel and Distributed Processing Techniques and Applications (PDPTA 95) pp Athens Georgia Nov Stanislav G. Sedukhin is a professor of computer science at University of Aizu. His research interests are in distributed and parallel high-performance computing and communications parallel algorithms architectural synthesis of the applicationspecific VLSI-processors. He received his Candidate of Sciences (Ph.D.) and Doctor of Physical & Mathematical Sciences (Dr.Sci.) in Computer Science from the Russian (former USSR) Academy of Sciences in 1982 and 1993 respectively. Prof. Sedukhin is a member of the IEEE Computer Society ACM SIAM and AMS. Shietung Peng received his M.S. and Ph.D. degrees in Computer Science from the University of Texas in 1984 and 1986 respectively. He was with the University of Maryland from 1986 to He is currently with University of Aizu. His research interests include parallel and distributed processing parallel algorithms parallel architectures and highperformance computing. He is a member of ACM and IEEE Computer Society.

Dense Matrix Algorithms

Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication