Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Size: px

Start display at page:

Download "Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks"

Allen Lamb
5 years ago
Views:

1 Journal of Comuting and Information Technology - CIT 8, 2000, 1, Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical Engineering and Comuter Science Lehigh University We consider the roblem of designing otimal and efficient algorithms for solving tridiagonal linear systems with multile right-hand side vectors on two-dimensional mesh interconnection networks. We derive asymtotic uer and lower bounds for these solvers using odd-even cyclic reduction. We resent various imortant lower bounds on execution time for solving these systems including general lower bounds which are indeendent of initial data assignment, and lower bounds based on classifications of initial data assignments which classify assignments via the roortion of initial data assigned amongst rocessors. Finally, different algorithms are designed in order to achieve running times that are within a small constant factor of the lower bounds rovided. 1. Introduction In this aer, we consider the roblem of designing algorithms for solving tridiagonal linear systems. Such algorithms are referred to as tridiagonal solvers. A method for solving these systems, in which there is articular interest by designers, is the well-known odd-even cyclic reduction method which is a direct method. Due to this interest, we chose to focus our attention on odd-even cyclic reduction. Thus, our results will be alicable to tridiagonal solvers designed on a two-dimensional mesh utilizing cyclic reduction. Much research has been sent exloring this roblem which deals with designing and analyzing algorithms that solve these systems on secific tyes of interconnection networks [1, 5, 7, 8, 9, 10, 11] such as hyercube or butterfly. However, very little has been done on determining lower bounds for solving tridiagonal linear systems [4, 6, 10, 11] on any tye of interconnection network or on secific general arallel models [9]. Moreover, few of these aers consider the case of multile righthand side (HS) vectors and the authors know of no aers which derive lower bounds on arallel run-time for odd-even cyclic solvers with multile HS vectors. The main objective of this aer is to resent asymtotic uer and lower bounds on the running time for solving tridiagonal systems with multile right hand sides which utilize odd-even cyclic reduction on 2-dimensional meshes. Our decision to work with meshes is based on the simle fact that they are very common and frequently used interconnection networks. The results obtained in this aer will rovide not only a means for measuring efficiency of existing algorithms but also rovide a means of inointing algorithm design issues, data layouts and communication atterns in order to achieve otimal or near-otimal running times. While we have already derived results for tridiagonal solvers with only a single vector on a 2-D mesh [10], the results were not easily adatable to multile vectors. In fact, multile vectors roduced several layers of comlexity towards analysis of this roblem. Some of the interesting results we shall show include the following: The skewness in the roortion of data will significantly affect running time. Another interesting result is the roof on the threshold for esearch artially suorted by an NSF CAEE Grant.

2 2 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks rocessor utilization. For some roblem sizes, using common data layouts and straightforward communication atterns do not result in significantly higher comlexities than assuming that all rocessors have access to all data items regardless of communication attern. In many cases, common data layouts can be used to obtain otimal running times. However, there is no one algorithm, data layout or communication attern which will rovide otimal run-times for all cases. In fact, we resent more than 6 algorithms/subalgorithms which are needed in order to show how to achieve otimal or nearotimal running times for all cases. The aer is divided as follows. Section 2 contains a descrition of the 2-D mesh toology. In Section 3 we discuss the odd-even cyclic reduction method for solving tridiagonal systems. In Section 4, we derive various imortant lower bounds on execution. First, we derive general lower bounds for solving tridiagonal systems, i.e. the bounds hold regardless of data assignment. We follow this by deriving lower bounds which rely on categorizing data layouts via the roortion of data assigned amongst rocessors. Furthermore, we describe commonlyused data layouts designers utilize for this roblem. Lastly, we resent a variety of algorithms and subalgorithms which will roduce running times within a small constant factor of the lower bounds derived. Section 6 gives the conclusion and summary of results Dimensional Mesh Interconnection Network A mesh network is a arallel model on rocessors in which rocessors are groued as two tyes: border rocessors and interior rocessors. Each interior rocessor is linked with exactly four neighbors. Each but for four border rocessors have three neighbors. And the remaining four border rocessors have exactly two neighbors. More recisely: Denote rocessors by some i j where 1 i j - For 1 < i j < the four neighbors of i j are i+1 j, i;1 j, i j+1, and i j;1 -Fori = 1 and 1 < j < the three neighbors of i j are i+1 j, i j+1, and i j;1 the three neighbors of i j are i;1 j, i j+1, and i j;1 -Fori = and 1 < j < -Forj = 1 and 1 < i < the three neighbors of i j are i+1 j, i;1 j, and i j+1 the three neighbors of i j are i+1 j, i;1 j, and i j;1 -Forj = and 1 < i < -Fori = 1 and j = 1 the two neighbors of i j are i+1 j, and i j+1 -Fori = 1 and j = the two neighbors of i j are i+1 j, and i j;1 -Fori = and j = 1 the two neighbors of i j are i;1 j, and i j+1 -Fori = and j = the two neighbors of i j are i;1 j, and i j;1 Communication between neighbor rocessors require exactly 1 time ste. By this, we mean that if a rocessor transmits a message to its neighbor rocessor at the beginning of time x, its neighbor rocessor will receive the message at the beginning of time x + 1. Moreover, we assume that rocessors can be receiving[transmitting] a message while erforming a local oeration. 3. Odd-Even Cyclic eduction Method The roblem: Given M which is an NN matrix, and vectors b 1 b 2 b each of size N. For each s( ), solve x s where Mx s = b s. We assume that b 1 b 2 b can be stored in one matrix B of size N where the s th column of B reresents vector b s. We make an analogous assumtion for x 1 x 2 x. Moreover, we assume for the sake of simlicity that 1 < = 2 2k N where k is an integer and = 2 r. An algorithm is simly a set of arithmetic oerations such that each rocessor is assigned a sequential list of these oerations. An initial assignment of data to the rocessors is called a data layout. A list of message transmissions and recetions between rocessors is called a communication attern. These three

3 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks 3 Fig. 1. ow deendency for odd-even cyclic reduction for N = 15 and = 1. comonents (algorithm, data layout, communication attern) are needed in order to determine running time. Odd-even cyclic reduction [2, 3, 7] is a recursive method for solving tridiagonal systems of size N = 2 n ; 1. This method is divided into two arts: reduction and back substitution. The first ste of reduction is to remove each odd-indexed x si and create a tridiagonal system of size 2 n;1 ; 1. We then do the same to this new system and continue in the same manner until we are left with a system of size 1. This requires n hases. We refer to the tridiagonal matrix of hase j as M j and the vector as b j s. The original M and b s are denoted M 0 and b 0 s. The three non-zero items of each row i in M j are denoted l j i m j i r j i (left, middle, right). Below is the list of oerations needed to determine the items of row i in matrix M j (which we refer to as M j i. e j i = ; lj;1 i m j;1 f j = i ; rj;1 i m j;1 i;2 j;1 l j i = e j i lj;1 i;2 j;1 rj i = fi j rj;1 i+2 j;1 i+2 j;1 m j i = m j;1 i + e j i rj;1 + f j i;2 j;1 i lj;1 i+2 j;1 b j s i = b j;1 s i + e j ib j;1 + f j s i;2 j;1 i b j;1 s i+2 j;1 The results in this aer assume that determining values for each e j i fi j l j i ri j m j i or b j s i are atomic, i.e. if a rocessor is comuting, for examle, e j i, then the rocessor must comute all the oerations to satisfy the equation for e j i given above. Clearly, each system is deendent on the revious systems. The lower half of Figure 1 shows the deendency between all the M j s in the reduction hase. The nodes in level j of the figure reresent the rows of M j. A directed ath from node a in level k to node b in level j imlies that row a in M k is needed in order to determine the entries for row b in M j. In fact, Figure 1 also inoints the deendencies between all the b j s i s. Since there are no deendencies between some b j s 1 i and some b y s 2 z (where s 1 6= s 2 ), the only deendencies remaining are those between M j s and b y s s. From the definition of the roblem, we see that for each s, b j s i is deendent on M j i.

4 4 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks In this aer, for simlicity, we assume that if an algorithm emloys odd-even cyclic reduction, we assume that a rocessor comuted items of whole rows of a matrix (i.e. the three non-zero data items). The back substitution hase is initiated after the system of one equation has been determined. We recursively determine the values of the x si s. The first oeration is x s2 n;1 = bn;1 s 2 n;1 m n;1 : For the remaining variables x n;1 si, let j denote the last hase 2 in the reduction ste before x si is removed, then x si = bj;1 s i ; l j;1 i x si;2 j;1 ; r j;1 m j;1 i i x si+2 j;1 : Again, we are assuming the oerations to comute the x si s are atomic for = 1. However, for the case of > 1, we divide the oerations for x si into the following atomic oerations: and tem i = lj;1 i x si;2 j;1 + r j;1 m j;1 i x si = bj;1 s i ; m j;1 tem i : i i x si+2 j;1 eturning back to Figure 1, we see that the back substitution is reresented by the to half of the grah. In essence, the entire tridiagonal system s deendency grah is + 1 coies of the grah in Figure 1 (one for the M matrix, and the remaining for the vectors) with some additional edges. Clearly, for any two vectors of b, their resective deendency grahs will not have any edges between them. In fact the only edges to be added are from the sub-deendency grah of M to every vector of b. In other words, for all vectors b s, each node in the M sub-deendency grah will have an edge to the mirror node in the sub-deendency grah of b s. The serial comlexity of this method is denoted by S(N ) where = 1. = 1 is a secial case, i.e. S(N 1 1)=19N ; 14 log(n + 1) ; 4: For > 1, S(N 1)=15N ; 10n;5+M(6N ; 4n;1): In Section 4, we derive several imortant lower bounds on running time for odd-even cyclic reduction algorithms on 2-D mesh networks. Secifically, we derive general lower bounds indeendent of data layout, lower bounds which are based on categorization of data layouts, and lower bounds on running time for common data layouts for this roblem. In Section 5.6 we resent otimal algorithm running times on 2-d mesh toologies. 4. Lower bounds for Odd-Even Cyclic eduction Definition 4.1. The class of all communication atterns is denoted by C. The class of all data layouts is denoted by D. A data layout D is said to be a single-item layout if each nonzero matrix-item is initially assigned to a unique rocessor. In the following sections we shall rovide lower bounds based on certain tyes of data layouts as well as general lower bounds. The lower bounds hold regardless of the choice of communication attern A General Lower Bound for Odd-Even Cyclic eduction In this subsection we assume the data layout is the one in which each rocessor has a coy of every non-zero entry of M 0 and b 0 s. We denote this layout by D. Since D is the best data layout ossible, a lower bound based on this data layout will rovide a general lower bound. emark 4.1. We will denote the (arallel) comutation lower bound for odd-even cyclic reduction of size N and and utilizing rocessors by S(N ). emark 4.2. Let A be an odd-even cyclic reduction algorithm. A comutation lower bound for A, given rocessors and assuming none of the rocessors are totally idle is S(N )= Ω( N + log N):

5 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks 5 Theorem 4.1. Let A be an odd-even cyclic reduction algorithm. The following is a lower bound for A regardless of data layout and communication attern: 8 >< >: max(s(n ) (=))=Ω( N + log N + (=)) if N 2 3 max(s(n N 2 3))=Ω(N 1 3) otherwise roof. We will begin by assuming that all rocessors will be used in comutation and/or communication. Clearly, S(N ) is a lower bound for the roblem. Furthermore, we see that there are grous of rocessors which work together in order to solve the items in the deendency grah of M or in any of the b s s grahs. Denote these (not necessarily disjoint) grous of rocessors by M 1, M 2 M z. Let 0 = max iz jm i j ( min(n )). Since these rocessors must ultimately roduce only one row of M in the back substitution hase, it is clear there is some tye of all-to-one broadcast of 0 rocessors in order to comlete the back-substitution hase of M and/or b s. Thus, on a two-dimensional mesh, this will require communication of at least 0. Moreover, this will also require comutation of at least N 0. Therefore, a lower bound, assuming all rocessors are used, is max(s(n ) N 0 0 ). This leads to the fact that 0 = max(1 ) in order to minimize the lower bound. Furthermore, we see that using more than Ω(N 2 3) rocessors, will reduce efficiency; i.e. the threshold for rocesor utilization is Ω(N 2 3). In Section 5.6 we will rovide otimal algorithms, i.e. the running times are within a small constant factor of the lower bounds. Therefore, we note that our results show that when is sufficiently large, i.e. = Ω(N 2 3) using more than N 2 3 rocessors will not lead to any substantial imrovements in running time if utilizing data layout D Lower Bounds for O on c -data layouts Many algorithms designed for solving tridiagonal linear systems assume that the data layout is single-item and that each rocessor is assigned roughly 1 th of the rows of M 0 and the items of b where is the number of rocessors available. However, in order to determine whether skewness of roortion of data has an effect on running time and if so, exactly how much, we classify data layouts by initial assignment roortions to each rocessor. In other words, in this section, we consider single-item data layouts in which each rocessor is assigned at most a fraction c of the rows of M and items of b where 1 c ( N+N;+1 N+N ). Definition 4.2. Consider c where 1 c < ( N+ N ; + 1 ). A data layout D on N + N rocessors is said to be a c -data layout if (a) D is single-item, (b) no rocessor is assigned more than a fraction c of the rows of M0 or items of b, (c) at least one rocessor is assigned exactly a fraction c of the rows of M0 or items of b, and (d) each rocessor is assigned at least one row of M 0 or items of b. Denote the class of c data layouts by D( c ). Theorem 4.2. If D 2D( c ), then for any A 2 O for vectors, a lower bound is 8 Nc 6 ) = Ω( N + log N + + Nc ) max(s(n ) roof. The comutational lower bound must also be a lower bound for the roblem regardless of communication. Thus S(N ) is a lower bound. Moreover, since at least one rocessor is assigned Nc ieces of data, this rocessor (denoted as 0 ) can either choose to use that data item in a binary arithmetic oeration or send out the item to another rocessor. Therefore, it is clear that Nc 6 must also be a lower bound. We will now describe the argument for why 8 must also be a lower bound. We will use a contradiction argument. Consider any two data items from M and/or B. These two items are said to be "related" if they

6 6 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks are both required by some comutation in order to rovide final solutions. Clearly, any initial data item of M, i.e. M 0 will be related to any other initial item of M. Suose the 8 is not a lower bound, this imlies that any two rocessors which are both assigned initial items of M cannot be more than or equal to a distance of 4 from each other on the 2-D mesh. Furthermore, consider a rocessor which comutes M n;1. rocessor will require data from all of the items of M 0. Clearly, only rocessors which are within at most a distance of 8 ;1 can be assigned items from M 0. Furthermore, exanding the distance from 8 ; 1to 4 ; 2 from will now contain all the rocessors which will utilize any item from M 0 for comutation. Finally exanding the distance from to 3 8 ; 3 will reresent all the rocessors that are assigned data items from not only M 0 but also items of B 0 which are related to any item in M 0. Every rocessor is assigned at least one initial iece of data and the layout is single-item. Moreover, every item in B 0 is related to at least one item of M 0. Thus, every rocessor assigned an item of B 0 or M 0 must be inside this region. However, the region does not include all the rocessors in the mesh. Therefore, this is a contradiction. Analyzing the result given in the above theorem, we see that 1 -data layouts will result in the best lower bounds for this class of data layouts, i.e. : Corollary 4.1. If D 2 D( 1 ), then for any A 2Ofor vectors, a lower bound is max(s(n ) )=Ω( N + log N + ) Comaring results against the general lower bound, we see that for sufficiently large N (i.e. N >>, or more recisely (N) 2 3), the lower bound for 1 -data layouts is recisely equal to the general lower bound. The comlexity of any algorithm A 2Ousing a 1 -data layout is Ω(log N + N + ). In Section 5.6, we resent algorithms that are within a small constant factor of the bounds resented in this and revious subsections. 5. Otimal Algorithms, Data Layouts, and Communication Schedules In this section, we design the algorithms which will roduce running times which are within a small constant factor of the lower bounds derived in the last section. We begin by discussing the data layouts we lan to use. We follow this with a discussion of some of the communication schedules (mostly dealing with distribution/broadcast of the data to rocessors that require this information for comutation). Lastly, we resent the otimal algorithms along with their running times. We will discuss under which circumstances one algorithm should be used over another in order to achieve otimal or near-otimal running times Definitions of Data Layouts In the revious sections, we have derived various lower bounds on running time for tridiagonal solvers which utilize odd-even cyclic reduction on a 2-dimensional mesh interconnection toology. In order to design algorithms which are within a constant factor of these lower bounds, we must determine which tyes of data layouts we will use. In this section, we resent definitions on data layouts in order to use these layouts in the next subsections in our algorithm designs. We would like to oint out that while we can easily use the best data layout D to obtain the general lower bounds, in this section we will resent data layouts which are not so owerful but which still achieve the lower bounds stated. In fact, in the cases where N >>, no rocessor is assigned more than O( N ) data items. We begin by resenting artial layouts and in each subsection, we discuss a data layout we will utilize in our otimal algorithms. Definition 5.1. Let be an integer where 1 N and 2 i ;1 < 2 i+1 ;1. A data layout on rocessors, 1, is an M-relicativeblocked data layout if for all j < 2 i;1, j is assigned the nonzero items in rows (j;1)2 n;i +1 to (j + 1)2 n;i ; 1 of M 0. We denote this layout by D M.

7 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks 7 Definition 5.2. Let be an integer where 1 N and 2 i ;1 < 2 i+1 ;1. A data layout on rocessors, 1,isanb s -relicativeblocked data layout if for all j < 2 i;1, j is assigned b s l where (j ; 1)2 n;i + 1 l (j+ 1)2 n;i ; 1. We denote this layout by D bs. Definition 5.3. Let be an integer where 1 N and = 2 i. A data layout on rocessors, 1, is an M-single-item-blocked data layout if (a) for all j < ;1, j is assigned the nonzero items in rows (j;1) ;1 N N +1 to j ;1 of M 0, (b) ;1 is assigned the nonzero items in rows ( ; 2) N ;1 + 1 to ( ; 1) N ;1 ; 1 of M0, and (c) is assigned the nonzero items in row NofM 0. We denote this layout by D 0 M. Definition 5.4. Let be an integer where 1 N and 2 i ;1 < 2 i+1 ;1. A data layout on rocessors, 1,isanb s -single-itemblocked data layout (a) for all j < ; 1, j is assigned the vector items (j;1) ;1 N N +1 to j ;1 of b s, (b) ;1 is assigned the nonzero items in vector items ( ; 2) N ;1 + 1 to ( ; 1) N ;1 ; 1 of b s, and (c) is assigned the vector item N of b s. We denote this layout by D 0 b s Data Layout 1 If N 2 n;1 3 2b then we will only use 2 3 c rocessors, and, in this case, we will simly assume n;1 2b = 2 3 c. This data layout will be utilized in order to achieve the general lower bounds when. We begin by artitioning the rocessors into grous of = 22k;r rocessors. For the sake of simlicity, we will assume that r is even. These grous will be referred to as i j where 1 i j. The rocessors assigned to grou s t are i j where (s ; 1)2 k; 2 r + 1 i s2 k; 2 r and (t ; 1)2 k; 2 r + 1 j t2 k; 2 r. We will create a new notation for the mesh rocessors in order to utilize the artial data layouts defined in this section. Consider the rocessors in grou s t, these rocessors will be denoted by s t 1 s t is denoted by 2 s t - s t if i is odd (i;1) =+j where (s;1)2k; r 2 +i (t;1)2 k; r 2 +j - s t if i is even (i;1) =+ =;j+1 For each grou s t, we will allocate the items of M using data layout D M. Moreover, this grou of rocessors will be allocated vector b (s;1) +t using data layout D b(s;1) r Data Layout 2 This data layout will be utilized in order to achieve the general lower bounds when >. We begin by artitioning the assignment of the b s vectors to the rocessors. Each rocessor will be assigned whole vectors. In articular i j will be assigned the items in vectors b l where (i;1) +(j;1) +1 l (i;1) +(j) : Moreover, every rocessor will be assigned the items in matrix M Data Layout 3 This is a single-item data layout and will be utilized in order to achieve the single-item data layout lower bounds when < N. We will create a new notation for the mesh rocessors in order to utilize the artial data layouts defined in this section. The rocessors will be denoted by 1 2 where i j is denoted by - (i;1) +j if i is odd - (i;1) + ;j+1 if i is even We will allocate the items of M using data layout D 0 M. Moreover, the rocessors will be allocated each vector b s using data layout D 0 b s. Clearly, this is a c -data layout where c is constant.

8 8 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Data Layout 4 This is a single-item data layout and will be utilized in order to achieve the single-item data layout lower bounds when > N and. We begin by artitioning the rocessors into grous of = 22k;r rocessors. These grous will be referred to as i where 1 i. If then the rocessors assigned to grou l are i j where (l ; 1)2 k;r + 1 i l2 k;r and 1 j. We will create a new notation for the mesh rocessors in order to utilize the artial data layouts defined in this section. Consider the rocessors in grou l, these rocessors will be denoted by l 1 l 2 l where (l;1)2 k;r +i j is denoted by - l (i;1) +j if i is odd - l (i;1) + ;j+1 if i is even However, if > then the rocessors assigned to grou l where l = x + y are i j where i = x + 1 and (y ; 1) + 1 j y. We will create a new notation for the mesh rocessors in order to utilize the artial data layouts defined in this section. Consider the rocessors in grou l, these rocessors will be denoted by l 1 l 2 l where (l;1)2k;r +i j is denoted by l (i;1) +j. egardless of the size of, for grou 1,we will allocate the items of M using data layout D 0 M. Moreover, each grou l will be allocated vector b l using data layout D 0 Clearly, b l. this is a c -data layout where c is constant Data Layout 5 This is a single-item data layout and will be utilized in order to achieve the single-item data layout lower bounds when > > N. We begin by artitioning the assignment of the b s vectors to the rocessors. Each rocessor will be assigned full vectors. In articular i j will be assigned the items in vectors b l where (i;1) +(j;1) +1 l (i;1) +(j) : Moreover, rocessor 1 1 will be assigned the items in matrix M. Clearly, this is a c -data layout where c is constant. The data layouts resented in this subsection will be used to design otimal or near-otimal algorithms for solving tridiagonal systems on a 2-dimensional mesh. We again note that while we could have simly assumed the best data layout for designing our algorithms to achieve the general lower bounds, we instead have listed data layouts in which no rocessor has been assigned more than O( N ) initial data items. Furthermore, we have resented several single item data layouts which we will be using in our otimal algorithm designs in later subsections. In the following subsection, we resent some communication schedules which we will use in our algorithms designs. These schedules are used to address data redistribution Communication Schedules For some of the algorithms we will design in the next subsection, it is imortant that we are able to redistribute the data from one layout to another. In this section, we resent communication schedules which do the needed redistributions Schedule 1 In this communication schedule, we are redistributing data layout 3 into the layout in which the matrix M is distributed using D M and each b s is distributed using D bs. Clearly, only neighbor rocessors (as ordered in the data layout discussion for 3) need to communicate. This would be accomlished by running the following: For all j ; 2, rocessors j do in arallel send all initially assigned data items to rocessor j+1 For all j 2, rocessor j do in arallel

9 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks 9 send all initially assigned data items to rocessor j;1 unning Time: O( N ) Schedule 2 In this communication schedule, we are redistributing data layout 4 into the layout in which the matrix M is distributed to each grou using layout D M. Moreover, we also wish to redistribute the b vectors such that instead of D 0 we wish the distribution to be D b, b.two slightly different schedules for this redistribution are needed: one for and one for >. We will discuss the schedule for. The redistribution schedule for the other case is very similar and, for brevity, we will omit its discussion. Consider. We begin by redistributing so that each grou is assigned the matrix M using D 0 Clearly, only rocessors above and M. below (on the mesh) a rocessor need the same data to communicate. Collecting the data items of M in each column to the first rocessor in the column and then sending out the information to every other rocessor in the column would effectively accomlish the first ste of the redistribution. The remaining redistributions can be easily taken care of using Schedule 1 above. This would be accomlished by running the following: For all j do in arallel For all 2 < i every rocessor i j do in arallel send all initial data items of M to 1 j (ielined broadcast) 1 j do a one to all (ielined) broadcast of the items of M to rocessors 2 j j. For each grou l do in arallel un schedule 1 for the assigned vector un schedule 1 for M unning Time: O( N + ) Schedule 3 In this communication schedule, we are redistributing data layout 5 into the layout in which the matrix M is distributed to every rocessor. This is simly a one-to-all broadcast of 3(N ; 1) items. Clearly, this can be accomlished in O(N ; 1 + ) stes. Due to the assumtions made in layout 5, the running tims is O( N + + log N) The Otimal Algorithms In the revious subsections, we have discussed the data layouts and some communication redistribution schedules we lan to use in our algorithms. In this section, we resent the algorithms which will roduce running times within at most a small constant factor of the lower bounds resented in the revious section SubAlgorithm 1 for Best Data Layout This algorithm solves a tridiagonal system by sequentially solving values based on values it has already comuted and not sent from another rocessor. This Subalgorithm will be a building block for the full algorithms. Assume y of the vectors of b are assigned to this grou of 0 = 2 q rocessors. Moreover, assume the rocessors are ordered using the ordering given in data layout 3. The items of row i of level j are considered to be l j i r j i m j i and b j s i (of all y vectors). Lastly, we will utilize only 0 ; 1 rocessors. SubAlgorithm 1 for Data Layout D Every rocessor i (1 i < 2 q ) do in arallel : for s = 1ton ; q do for z = 2 n;q;s (i ; 1)+1 to 2 n;q;s (i + 1) ; 1 do comute (serially) the items of row z of level s send items of row i of level n ; q to i+1 and receive from i;1

10 10 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks /*ifi = 0 ; 1 do not send, and if i = 1 do not receive */ send items of row i of level n ; q to i;1 and receive from i+1 /*ifi = 2 q ;1 do not receive, and if i = 1 do not send */ for s = 1toq ; 1 do if 2 i s is an integer then comute the items of row 2 ;s i of level n ; q + s if s 1 2 log then send items of row i2 ;s of level n;q+s to i+2 s routed through the ath ( i+1 i+2 i+k i+k+1 i+2 s ;1) and received from i;2 s /*ifi + 2s > 2 q ; 1 do not send, and if i ; 2 s < 1 do not receive */ send items of row i2 ;s of level n;q+s to i;2 s routed through the ath ( i;1 i;2 i;k i;k;1 i;2 s ;1) and received from i+2 s /*ifi + 2s > 2 q ; 1 do not receive, and if i ; 2 s < 1 do not send */ else s = 1 2 log + k send items of row i2 ;s of level n;q+s to i+2 s routed through the ath ( i+ i+2 i+j i+(j+1) i+(2 k ;1) ) and received from i;2 s /*ifi + 2s > 2 q ; 1 do not send, and if i ; 2 s < 1 do not receive */ send items of row i2 ;s of level n;q+s to i;2 s routed through the ath ( i; i;2 i;j i;(j;1) i;(2 k ;1) ) and received from i+2 s /*ifi + 2s > 2 q ; 1 do not receive, and if i ; 2 s < 1 do not send */ For back substitution, if i was the last rocessor to handle items of row j then i comutes all x si where b s is assigned to i (and if necessary, send and receive aroriate values of all x sm ) unning Time: O( Ny 0 + y log ) 5.7. Algorithm 1 for Data Layout 1 This data layout is used when. This algorithm is simly solving the tridiagonal system with one vector. We have artitioned the rocessors into grous. Since the best data layout actually is not needed for subalgorithm 1 and in fact our data layout is sufficient, we simly call SubAlgorithm 1 for each grou in arallel with 0 =, y = 1. Therefore the total running time is O( N +log +q )=O( N +log N+q ): As we stated in the subsection defining Data Layout 1, if N 2 3, we simly assume = N 2 3. Therefore the running time is clearly O(N 1 3) Algorithm 2 for Data Layout 2 This data layout is used when >. This algorithm is simly sequentially solving the tridiagonal system with right-hand side vectors. We have artitioned the vectors into grous. Each rocessor in arallel, serially solves their assigned system. Therefore the total running time is O( N +log N)=O(N +log N +q ): 5.9. Algorithm 3 for Data Layout 3 This algorithm first calls schedule 1, then it simly solves the tridiagonal system with vectors. We consider the case < N. Since the best log N data layout is actually not needed for subalgorithm 1 and in fact our data layout is sufficient (after schedule 1), we simly call SubAlgorithm 1 with 0 =, y =. Therefore the total running time is the time for Schedule 1 lus the time for Subalgorithm 1: O( N + log N + ): Since < O( N N log N + + log N). then the run-time is clearly N N, subal- For the case where log N gorithm 1 is easily modified so that the tasks

11 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks 11 in the last hases of reduction and the first hases of back substitution are relicated so that at least half the rocessors are always working. O( N This will roduce a running time of: + log N + ): Algorithm 4 for Data Layout 4 This algorithm first calls schedule 2 then it is simly solving the tridiagonal system with one vector. We have artitioned the rocessors into grous. Since the best data layout is actually not needed for subalgorithm 1 and in fact our data layout is sufficient (after schedule 2), we simly call SubAlgorithm 1 for each grou in arallel with 0 =, y = 1. Therefore the total running time is O( N + log N + ): Algorithm 5 for Data Layout 5 This algorithm first calls schedule 3, then it is simly sequentially solving the tridiagonal system with right-hand side vectors. We have artitioned the vectors into grous. Each rocessor in arallel, serially solves their assigned system. Therefore the total running time is O( N + log N + ): In this section, we designed algorithms for each of the data layouts which cover all roblem sizes. Analyzing the running times of our algorithms, we see that they are all within a small constant factor of the lower bounds rovided. In articular, the algorithms for Data Layouts 1 and 2 satisfy the general lower bounds. The algorithms for the remaining Data Layouts all satisfy the single-item bounds. Moreover, analyzing the design of the algorithms, we see that to achieve the single-item bounds simly requires redistribution of a single-item layout to the layouts used to achieve the general lower bounds. Furthermore, no layout assigned any rocessor more than O( N ) initial data items. 6. Conclusion In this aer we examined the roblem of designing otimal tridiagonal solvers on 2-dimensional mesh interconnection networks with multile right hand sides using odd-even cyclic reduction. This is the first aer to have tackled and solved this roblem asymtotically. We were able to derive asymtotic lower bounds on the execution time. Moreover, we were able to rovide algorithms whose running times differ by only a small constant factor of these lower bounds. Secifically, we roved that the comlexity for solving tridiagonal linear systems regardless of data layout (i.e. a general arallel lower bound on a 2-dimensional mesh) is Ω(N 1 3) when = Ω(N 2 3) else the bound is Ω( N +log N+q ). This shows that utilizing more than Ω(N 2 3) rocessors will not result in any substantial runtime imrovements. In fact, utilizing more rocessors may result in much slower run-times. When we added the realistic assumtion that the data layouts are single-item and the number of data items assigned to a rocessor is bounded, we derived lower bounds for classes of data layouts. The asymtotic lower bound is Ω( N + log N + ). We rovided three tyes of data layouts which rovided otimal running times. In other words, we showed that the best tyes of layouts in general are those in which either (a) each rocessor is assigned an equal number of rows of M 0 and b, (b) the rocessors are groued into grous in which every grou is assigned of the vectors, or (c) each rocessor is assigned whole vectors of b. Therefore, we see that the skewness of roortion of data will significantly affect erformance. Comaring the lower bounds for the c -layouts with the general lower bounds, we see that restricting the roortion of data items assigned to a rocessor to N does not result in a significantly higher comlexity than assuming all rocessors have all the data items for sufficiently large N (i.e. N 2 3). Lastly, we show that there are algorithms, data layouts, and communication atterns whose running times are within a constant factor of the lower bounds rovided. This rovides us with the Θ-bounds stated above. To achieve the general lower bound, i.e. the comlexity for these methods regardless of data layout, while we could have used D, the best

12 12 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks data layout, i.e. the data layout in which every rocessor is assigned all the data items, we did not. We resented two data layouts in which no rocessor was assigned more than O( N ) data items. Two tyes of algorithms roduce otimal running times: (a) algorithms in which rocessors are artitioned into sets such that each set is in essence an indeendent tridiagonal system, or (b) algorithms in which each rocessor sequentially solves of the vectors. As stated reviously, three different data layouts were rovided in order to design otimal algorithms for single-item layouts. The three tyes of algorithms are: (a) algorithms which in essence utilize block layouts and solve the tridiagonal system times. (b) algorithms in which rocessors are artitioned into sets such that each set is in essence an indeendent tridiagonal system, or (c) algorithms in which each rocessor sequentially solves of the vectors. For (b)-(c), the algorithms rovided are very similar to those to achieve the otimal running times (regardless of data layout). In fact, the difference in run-time is rimarily the need to do redistributions of data in order to run the otimal algorithms. Finally, it is worthwhile to observe that for sufficiently large N, the single-item lower bounds are asymtotic to the general lower bound. eferences [1] C. AMODIO AND N. MASTONADI, A arallel version of the cyclic reduction algorithm on a hyercube, arallel Comuting, 19, [2] D. HELLE, A survey of arallel algorithms in numerical linear algebra, SIAM J. Numer. Anal., 29(4), [3] A. W. HOCKNEY AND C.. JESSHOE, arallel Comuters, Adam-Hilger, [4] S. L. JOHNSSON, Solving tridiagonal systems on ensemble architectures, SIAM J. Sci. Stat. Comut., 8, [5] S.. KUMA, Solving tridiagonal systems on the butterfly arallel comuter, International J. Suercomuter Alications, 3, [6] S. LAKSHMIVAAHAN AND S. D. DHALL, A Lower Bound on the Communication Comlexity for Solving Linear Tridiagonal Systems on Cube Architectures, In Hyercubes 1987, [7] S. LAKSHMIVAAHAN AND S. D. DHALL, Analysis and Design of arallel Algorithms : Arithmetic and Matrix roblems, McGraw-Hill, [8] F. T. LEIGHTON, Introduction to arallel Algorithms and Architectures: Arrays-Trees-Hyercubes, Morgan Kaufmann, [9] E. E. SANTOS, Otimal arallel Algorithms for Solving Tridiagonal Linear Systems, In Sringer- Verlag Lecture Notes in Comuter Science #1300 (roceedings of Euro-ar), [10] E. E. SANTOS, Otimal Tridiagonal Solvers on Mesh Interconnection Networks In Sringer-Verlag Lecture Notes in Comuter Science #1557 (roceedings of ACC), [11] E. E. SANTOS, Solving Tridiagonal Linear Systems on a Torus roceedings of the International Conference on arallel and Distributed rocessing and Techniques, eceived: May 15, 1999 Acceted in revised form: January 21, 2000 Contact address: Eunice E. Santos Deartment of Electrical Engineering and Comuter Science 19 Memorial Drive West Lehigh University Bethlehem, A hone: +1 (610) Fax: +1 (610) santos@eecs.lehigh.edu EUNICE E. SANTOS is an Assistant rofessor in the Deartment of Electrical Engineering and Comuter Science at Lehigh University. She joined the faculty at Lehigh in August of 1995 after receiving a h.d. in Comuter Science from the University of California, Berkeley in May of that year. Dr. Santos has also obtained B.S. and M.S. degrees in both Mathematics and Comuter Science. Dr. Santos s research interests include: arallel and distributed rocessing, algorithm design and analysis, arallel comlexity, scientific and numerical comuting, otimization, comutational science, and evolutionary comuting. She has over 30 technical ublications in arallel rocessing alone. In 1996, Dr. Santos was awarded an NSF CAEE Grant in order to ursue her research interests.

Introduction to Parallel Algorithms

CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has