Time-Optimal Algorithms for Generalized Dominance Computation and Related Problems on Mesh Connected Computers and Meshes with Multiple Broadcasting Ion Stoica Abstract The generalized dominance computation (GDC) problem is stated as follows: Let A = fa 1 ; a 2 ; : : :; a n g be a set of triplets, i.e. a i = (x i ; y i ; f i ), \< " be a linear order relation dened on x-components, \<" be a linear order relation dened on y- components and \" an abelian operator dened on f-components. It is required to compute for every a i 2 A, the expression D(a i ) = f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : :; j k g is the set of all indices j such that a j 2 A and x j < x i, y j < y i. First, this paper presents a time-optimal algorithm to solve the GDC problem in O( p n) on a mesh connected computer of size p n p n. To prove the generality of our approach, we show how a number of computational geometry problems such as ECDF (empirical cumulative distribution function) searching and two-set dominance counting, can be derived from GDC problem. Second, we dene a natural extension of the GDC, called multiple-query generalized dominance computation (MQGDC), on meshes with multiple broadcasting. By using multiple querying (MQ) paradigm of Bokka et al. [3, 4, 6] we devise a time-optimal algorithm that solves a MQGDC problem involving a set A of n items and a set Q of m queries in O(n 1 6 m 1 3 ) on a mesh with multiple broadcasting of size p n p n. Keywords: mesh connected-computers, broadcasting, multiple buses, computational geometry, parallel algorithms, generalized dominance computation, multiplequery, generalized multiple search, generalized prex computation. Department of Computer Science, Old Dominion University, Norfolk, VA 23529-0162 (stoica@cs.odu.edu). 1
1 Introduction A mesh-connected computer (simply known as mesh) of size n 1 n 2 consists of n 1 n 2 identical processors arranged on a n 1 n 2 grid where each processor is connected to its four neighbors by bidirectional links. Each processor has a xed number of registers, each of size O(log n 1 n 2 ), and can perform standard arithmetic and boolean operations in unit time. Each processor can also send the contents of a register to one of its neighbors and receive data from a neighbor in a special register in unit time. A mesh is assumed to function in a SIMD mode; all processors are synchronized and operate under the control of a single instruction stream issued by a control unit. Due to their simple interconnection topology and to the fact that many problems can be easily mapped on them, mesh-connected computers have become a popular choice for solving a large number of problems in image processing, computational geometry and pattern recognition [12, 11]. Unfortunately, meshes suer from major limitations when data need to be transferred over long distances. A natural solution to this problem was to add row and column buses to the existing meshes [5, 8, 9, 10]. These meshes, known as meshes with multiple broadcasting (MMB for short) have already been implemented and are currently available [13]. At any time only one processor can broadcast its data on a given bus. On the other hand, all processors connected to a bus can concurrently read the data broadcast on that bus. Throughout this paper the communication along column and row buses is assumed to take unit time, independent of the length of the bus [2, 5, 8, 13]. The GDC problem is a generalization of the well-known empirical cumulative distribution function (ECDF) introduced by Springsteel and Stojmenovic in [14]. The ECDF problem that can be formulated as follows: Given a set S = fs 1 ; s 2 ; : : : ; s n g of n points in plane, for every point s i 2 S count the number of points in S that are dominated by s i (we say, that a point s i dominates a point s j, if and only if the x-coordinate of s i is larger than the x-coordinate of s j, and the y-coordinate of s i is larger than the y-coordinate of s j ). The GDC problem generalizes the linear order relations on both x and y coordinates and the counting operation. Formally, the GDC problem is stated as follows: Let A = fa 1 ; a 2 ; : : : ; a n g be a set of triple, i.e. a i = (x i ; y i ; f i ), let \<" be a linear order relation dened on x-components, let \<" be a linear order relation dened on y- 2
components and let \" be an abelian operator dened on f-components. The problem requires to compute for every a i 2 A, the expression D(a i ) = f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j such that a j 2 A and x j < x i, y j < y i. In this paper, we present a time-optimal algorithm to solve the GDC problem, involving a set A of n items, in O( p n) on a mesh connected computer of size p n p n. To prove the generality of our approach, we show how a number of computational geometry problems, including two-set dominance counting and maximal vectors, can be derived from GDC problem. Next, to take advantage on the architecture of meshes with multiple broadcasting we dene a natural extension of the GDC, called multiple-query generalized dominance computation (MQGDC). By using multiple querying (MQ [6] 1 ) paradigm we devise a time-optimal algorithm that solves a MQGDC problem involving a set A of n items and a set Q of m queries in O(n 1 6 m 1 3 ) on a mesh with multiple broadcasting of size p n p n. The remainder of this paper is organized as follows: section 2 presents a time-optimal algorithm to solve the GDC problem on mesh connected computers; section 3 presents some applications of GDC paradigm to several problems in computational geometry; section 4 denes the MQGDC problem and presents a time-optimal algorithm on meshes with multiple broadcasting; section 5 summarizes our ndings and indicates some possible directions for the future work. 2 A Time-Optimal GDC Algorithm on Mesh Connected Computers We dene the general dominance computation (GDC) problem as follows: Let A = fa 1 ; a 2 ; : : : ; a n g be a set of n items, where every item in A is a triplet, i.e. a k = (x k ; y k ; f k ) (1 k n). Further, consider: an abelian operator \" dened on the set of f-components, f 1 ; f 2 ; : : : ; f n a linear order relation \<" dened on the set of x-components, x 1 ; x 2 ; : : : ; x n 1 In an early draft of [6] the MQ paradigm has been known as generalized multiple search (GMS) paradigm, and this is the name under which it was reered in some previous papers (ex., [7]). 3
a linear order relation \<" dened on the set of y-components, y 1 ; y 2 ; : : : ; y n The problem requires to compute for every item a i 2 A, the expression D(a i ) = f j1 f j2 : : :f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j such that a j 2 A and x j < x i, y j < y i. In this section, we present a time-optimal algorithm to solve the GDC problem involving a set of n items, A = fa i j 1 i ng, on a mesh connected computer of size p n p n. It is convenient to interpret every triple ak 2 A as a point in a plane, where x k and y k are the point coordinates and f k is some value associated to it (e.g. if the point represents a pixel of an image, than its value can be the pixel intensity). We say that an item a i dominates an item a j, and we write a j a i, if and only if x j < x i and y j < y i. For any subset A 0 of A, let D(a m ; A 0 ) be the expression f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j such that a j 2 A 0 and x j < x m, y j < y m (further we use both D(a k ; A) and D(a k ) notations, interchangeable). For example, in gure 1, D(a 7 ) = f 3 f 4 f 5 f 8 f 14 and D(a 7 ; A 0 ) = a 3 a 14. y a13 A a12 a2 a11 a8 a 14 a7 a6 a3 a1 a16 a9 a5 a4 a10 a15 x Figure 1: An instance of the GDC problem. Lemma 1 Given two disjoint subsets A 0, A 00 of A, then D(a k ; A 0 [ A 00 ) = D(a k ; A 0 ) D(a k ; A 00 ) for every a k 2 A. 4
Proof. First, for every value f i in expression D(a k ; A 0 [ A 00 ) we have, from the denition, a i 2 A 0 [ A 00 and x i < x k, y i < y k. Thus, f i is either include in D(a k ; A 0 ) or in D(a k ; A 00 ) and therefore f i is also included precisely once D(a k ; S 0 ) D(a k ; A 00 ). Conversely, for every value f i in expression D(a k ; A 0 ) D(a k ; A 00 ), we have x i < x k, y i < y k and either a i 2 A 0 or a i 2 A 00 (i.e. a i 2 A 0 [ A 00 ) and therefore f i is also in D(a k ; A 0 [ A 00 ). Thus, both D(a k ; A 0 [ A 00 )and D(a k ; A 0 ) D(a k ; A 00 ) contain the same items, and every item occurs exactly once in each sequence. Since operator is both associative and commutative, this ensures that D(a k ; A 0 [ A 00 ) = D(a k ; A 0 ) D(a k ; A 00 ) h5 h4 X33 h3 h2 h 1 a13 a14 a8 a5 a2 a12 a3 a4 a11 a7 A 33 a1 a16 a6 a9 a10 a15 v 1 v S 2 v3 33 Y v4 v *3 5 Figure 2: An example of partitioning of a set of 16 items in xy-plane, by 5 vertical and 5 horizontal line. X 33 = A 31 [ A 32, Y 3 = A 13 [ A 23 [ A 33 [ A 43 and S 33 = A 11 [ A 12 [ A 21 [ A 22. The idea of the algorithm is to partition the set A in disjoint subsets on both x and y-components (coordinates). Intuitively, this can be viewed as a partition of the xy-plane by m h horizontal and m v vertical lines, such that all items in A lie between the extreme lines (see gure 2). More precisely, let h 1 < h 2 < : : : < h mh be a sorted sequence of y-coordinates and v 1 < v 2 < : : :< v mv be a sorted sequences of x-coordinates, such that all the points in A lie in the region delimited by h 1, h mh, v 1, v mv, i.e. for every a k 2 A, 5
v 1 < x k < v mv and h 1 < y k < h mh. Next, let us denote by A ij the set of points that lie in the region delimited by h i, h i+1 and and v j, v j+1, i.e. A ij = fa k 2 A j v i < x i < v i+1 and h j < y k < h j+1 g. It is clear, that for any item a k 2 A ij, all the points it dominates (D(a k )) are contained in the sets A lm, where l i and m j (see gure 2). Further, for every set A ij we dene the following related sets (see gure 2): [ S ij = A lm ; 1 l < i; 1 m < j [ [ X i = [ A im ; 1 m m v ; Y j = [ A lj ; 1 l m h X ij = A im ; 1 m < j; Y ij = A lj ; 1 l < i (1) It is easy to see (gure 2) that S ij, Y j and X ij are disjoint and their union contains all sets A lm that can contain all items in A dominated by any item in A ij. Using lemma 1, the solution of every item a k 2 A ij can be written as : D(a k ) = D(a k ; Y j ) D(a k ; X ij ) D(a k ; S ij ) (2) The above equation is in fact the core of our algorithm, which can be easily divided into 3 stages: 1. for every point a k 2 A, such that a k 2 A ij, compute the partial solution D(a k ) = D(a k ; Y j ); 2. for every point a k 2 A, such that a k 2 A ij, compute D(a k ; X ij ) and update the partial solution D(a k ) = D(a k ) D(a k ; X ij ); 3. for every point a k 2 A, such that a k 2 S ij, compute D(a k ; S ij ) and the nal solution D(a k ) = D(a k ) D(a k ; S ij ); Notice that all items in S ij are dominated by every item in A ij. Therefore, it is enough to perform stage 3 only once for all items in A ij. The remainder of this section shows how each stage of the algorithm is implemented on a mesh connected computer of size p n p n. Every processor of the mesh stores one item a i from A. For computation purpose, every item a k 2 A, besides its three components x k, y k and f k contains two other ones: col k and row k. These components 6
represent the indices of the set A ij to which a k belongs, i.e. a k 2 A rowk col k (row k = i, col k = j). Stage 1. First, we sort all items a i by their x-component in column major order (i.e. P (1; 1) contains the a i with the smallest x i, P (2; 1) contains the a i with the second smallest x i, etc.) and initialize the solutions D(a i ) (1 i n) to the identity element of \" (see gure 3.a). Now, consider the natural partition of the items on y-components according to columns, a k 2 P (i; j), a k 2 Y j. Therefore, if a k 2 P (i; j), col k is initialized to j. Next, we propagate a copy of every item along mesh columns such that every processor receives a copy of each item stored by a processor in that column. Let a k be the item stored on the processor P (i; j) and a l be a copy of an item it receives during the above operations. If a l a k, then P (i; j) updates D(a k ), D(a; k) = D(a; k) f l. It is easy to see that after P (i; j) is visited by all items stored by processors on the same column, D(a k ) = D(a k ; Y j ) and the rst stage is completed. Since the sorting operation can be performed in O( p n) time on a mesh parallel computer of size p n p n ([15]), and the propagation of an item to all the other items on the same column takes also O( p n) time, it is clear that stage 1 takes O( p n) time. a a a a 8 12 7 10 a a 4 15 a5 a10 a13 a 4 a16 a6 a16 a 3 a9 a1 a5 a3 a11 a15 a8 a7 a14 a6 a14 a2 a1 a9 a12 a11 a13 a2 a) b) Figure 3: The items in gure 1 sorted in column major order by their x-component (a) and in row major order by their y-component (b). The P(1, 1) is the top-leftmost processor and P(4, 4) is the bottom-rightmost processor of the mesh. Stage 2. This stage is very similar to the previous one. We sort all points a i by their y-component in row major order (i.e. P (1; 1) contains the a i with the smallest y i, P (1; 2) 7
contains the a i with the second smallest y i, etc.) and we consider the natural partition of the items on their x-component according to rows, a k 2 P (i; j), a k 2 X i (gure 3.b). Therefore, if a k 2 P (i; j), row k is initialized to i. Notice that at this point the sets A ij, 1 i; j n are well dened: A ij = fa k 2 A j row k = i ^ col k = jg. Next, analogous to the previous stage, we propagate a copy of every item along the mesh rows to each other processor on that row. Now, let a k be the item stored on the processor P (i; j) and a l be a copy of an item it receives. If a l a k and col l < col k then P (i; j) updates D(a k ), D(a k ) = D(a k ) f l. Notice that in this case an additional test, col l < col k, is performed. This ensures that only the items in X ij are considered and therefore at the end of this stage we have: D(a k ) = D(a k ; Y j ) D(a k ; X ij ). As the previous stage, stage 2 requires O( p n) time. Stage 3. This stage computes the last term of D(a k ) from equation 2, i.e. D(a k ; S ij ). For this, every processor maintains two local variables b ij and s ij initialized to the identity element of \". This stage consists of two phases. First, a copy of every item is propagated along the mesh rows to each processor on that row. When the processor P (i; j) receives a copy of the item a k, it checks whether col k < j. If this is true, then it updates b ij, b ij = b ij f k. Since all the items on the same row have row-components equal, at the end of this phase, b ij = f l1 f l2 : : :f lm where X ij = fa l1 ; a l2 ; : : : ; a lm g. Notice that b ij could be computed concurrently with D(a k ; X ij ) in the previous phase. The only reason we have not compacted these computations is to increase the clarity of the algorithm. In the second phase, all values b ij are propagated along their corresponding column j. In this way, every processor P (i; j) receives all the b lj values, where l < i. Upon receiving b lj the processor P (i; j) updates its variable s ij, s ij = s ij b lj. Therefore, at the end this phase, s ij = b 1j b 2j : : : b i?1;j. Since S ij = X 1j [ X 2j [ : : : [ X i?1;j, from lemma 1, we obtain D(a k ; S ij ) = s ij. Next, every value s ij is propagated to every processor on the same row. Upon receiving s ij, every processor P il which stores item a k checks whether col k = j. If this is the case, then the nal value of D(a k ) is computed, D(a k ) = D(a k ) s ij = D(a k ; Y j ) D(a k ; X ij ) D(a k ; S ij ). Since this stage requires only propagations on rows or columns, it takes O( p n) time and therefore all stages can be performed in O( p n) time. To prove that the algorithm is 8
time-optimal is trivial. Consider the initial distribution of the items on the mesh, such that the item a k stored on processor P (1; 1) is dominated by the item a l stored on the processor P ( p n; p n). For computing D(a l ) we need f k, but since the distance between P (1; 1) and P ( p n; p n) is O( p n) this cannot be done faster than ( p n). Thus, we have the following result. Theorem 1 The GDC problem involving a set A of n items can be solved in O( p n) time on a mesh connected computer of size p n p n. Moreover, this time is optimal. 3 Some GDC Applications on Mesh Connected Computers To demonstrate the power of the GDC we now give some examples of geometry computational problems for sets of points in plane that can be reduced to the GDC problem. Let S = fs 1 ; s 2 ; : : : ; s n g be a set of points in plane and x(s i ), y(s i ) be x and respective y-coordinate of s i in plane. We say, that a point s i dominates a point s j if and only if x j < x i and y j < y i. 1. ECDF (empirical cumulative distribution function) searching problem. Determine for every point s i 2 S the total number of points in S dominated by s i. The corresponding instance of the GDC problem has the following parameters: x k = x(s k ), y k = y(s k ); f k = 1; \<", \<" = < dened on R; \" = + The result for s i is D(a i ). 2. Two-set dominance counting problem. Given two disjoint sets of points in plane S 1 and S 2, determine for every point s i 2 S 2 the number of points in S 1 dominated by s i. By denoting to S = S 1 [ S 2, the corresponding instance of the GDC problem has the following parameters: f k = 8 >< >: 1 if s k 2 S 1 0 if s k 2 S 2 9
x k = x(s k ), y k = y(s k ) for every s k 2 S; \<", \<" = < dened on R; \" = + The result for s i 2 S 2 is D(a i ). 3. Maximal vectors. Determine all the points in S that are not dominated by any other point in S. The corresponding instance of the GDC problem has the following parameters: x k = x(s k ), y k = y(s k ); f k = 1; \<", \<" = > dened on R; \" = + Where s i is maximal if and only if D(a i ) = 0. Notice that, in this case, for every s i 2 S, the corresponding D(a i ) represents the number of points s j 2 S such that x(s j ) > x(s i ) and y(s j ) > y(s i ), i.e. the number of all points S that dominate s i. Therefore, the D(a i ) = 0 if and only if s i is not dominated by any other point in S. Although these problems can be also solved using the generalized prex computation (GPC) technique, as shown in [1] and [14], we think that our approach is more direct and elegant for the above examples. 4 A Time-Optimal MQGDC Algorithm on Meshes with Multiple Broadcasting Although meshes with multiple broadcasting handle data transfer operations over large distances much faster than mesh connected computers, they cannot signicantly \speed up" the algorithms for dense problems as GDC. To see why, let us take an instance of the GDC problem with an input A, of size n, partitioned into two equal sized sets A 0 and A 00, such that every item a i 2 A 0 does not dominate any other item a j 2 A and every item a i 2 A 00 is a maximal element for A (i.e. is not dominated by any other element in A) and dominates exactly one item in A 0 (see gure 4). Clearly, the solution for every a i 2 A 00 i is D(a i ) = f j, where a j is the item in A 0 dominated by a i. Now, consider that the items in A 0 are stored one per processor in the rst p n 2 columns, and the items in A 00 10
are stored one per processor in the last p n columns of a mesh with multiple broadcasting 2 of size p n p n (see gure 4). y A A" a1 an/2+1 an/2+2 a2 a n/2+3 a3 n P n/2 n/2 A A" an/2 an x a) b) Figure 4: An instance of GDC problem used to prove the lower bound on a mesh with multiple broadcasting: a) every item from set A 00 dominates exactly one item from set A 0 ; b) all items of set A 0 are stored in the rst n 1 2 =2 columns, and all items of set A 00 are stored in the last n 1 2 =2 columns of a mesh with multiple broadcasting of size n 1 2 n 1 2. To compute the nal solution for every a i 2 A 00 it is clear that either a i, or the corresponding dominated point a j must cross the line P that separates rst p n 2 columns from the last p n 2. Since there are n=2 such pairs and only p n items can traverse plane P at one moment, it is clear that any algorithm that correctly solves the GDC problem takes at least ( p n). Thus, we have the following result: Lemma 2 Any algorithm that correctly solves the GDC problem, involving a set A of n items, on a mesh connected computer of size p n p n takes at least ( p n) time. However, in many practical applications we are not interested to compute the solutions for all the items in the set A, but rather for a sub-set of items. Therefore, as a natural extension of the GDC problem we dene the multiple-query generalized dominance computation (MQGDC) 2 as follows: Let A = fa 1 ; a 2 ; : : : ; a n g be a set of items and 2 As we will show the MQGDC also help us to take advantage of the mesh with multiple broadcasting architecture. 11
Q = fq 1 ; q 2 ; : : : ; q m g (1 m n) a set of queries, where every a i is a triplet (x i ; y i ; f i ) and every q i is a pair (x i ; y i ). Further, consider: an abelian operator \" dened on the set of f-components of the items in A a linear order relation \<" dened on the set of x-components of the items in both A and Q a linear order relation \<" dened on the set of y-components of the items in both A and Q The problem requires to compute for every item q i 2 Q, the expression D(q i ) = f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j of items in A for which x j < x i and y j < y i. Notice that this problem can be viewed as a generalization of the two-set dominance problem in the same sense in which GDC can be viewed as a generalization of the ECDF problem. Therefore, we can interpret, again, every triplet a k 2 A as a point in plane, where x k and y k are the point coordinates and f k is some value associated to it. In the same manner every pair q k 2 Q is interpreted as a point with coordinates x k and y k. Then, the solution of the MQGDC problem is to determine for every query q k 2 Q the -sum over all values f i of all points a i 2 A dominated by q k. Now it is obvious that, if we take as being the summation operator (+) and f k = 1 for every a k 2 A we can derive the two-set dominance problem. 4.1 Multiple Querying on Meshes with Multiple Broadcasting In solving the MQGDC problem on a mesh with multiple broadcasting we use a new powerful paradigm that was recently developed by Bokka [3, 4, 6] to solve the multiple querying (MQ) problem on MMBs. The MQ problem is stated as follows [6]: Consider collections A = fa 1 ; a 2 ; : : : ; a n g of items and Q = fq 1 ; q 2 ; : : : ; q m g (1 m n) of queries, and a decision problem : QA! f\yes", \no"g. For every i (1 i m), let S i be the set of items a j 2 A for which (q i ; a j ) = \yes", and let f be an abelian semigroup-type function operating on S i. The problem requires to determine for every q i (1 i m) the corresponding f(s i ) (where f(s i ) is called the solution of q i ). 12
For completeness the theorem for lower bound of MQ is stated and the algorithm for solving MQ optimally is outlined [6]: Theorem 2 (BOK94) Any algorithm that correctly solves the MQ problem involving a set A of n items stored one per processor and a set Q of m (1 m n) queries stored m one per processor in the rst pn columns of a mesh with multiple broadcasting of size p p 1 n n must take at least (n 6 m 1 3 ) time. Let M be a mesh with multiple broadcasting of size p n p n. The items a i 2 A are stored one per-processor in M, while the queries q i 2 Q are stored one per processor in the rst m pn columns of M. For simplicity, let us denote to s = n 1 6 m 1 3. Next, consider a partition of the initial mesh M in square submeshes of size s s denoted M ij (1 i; j p n s ), where M ij contains the processors located in (i? 1)s + 1; : : : ; is rows and (j? 1)s + 1; : : : ; js columns in M (see gure 5). n m/ n s m/ n M 11 M M 12 13 s M 11 M12 M13 M 11 M M 12 13 n M M M 21 22 23 M M M 21 22 23 M M M 21 22 23 M M M 31 32 33 M31 M32 M33 M31 M32 M33 a) b) c) Figure 5: Essential data movement involved in the MQ generic algorithm: a) queries are stored one per processor in the rst m=n 1 2 columns of the original mesh M; b) queries are replicated on every submesh M ij of size s s; c) after each submesh M ij solves the local problem, the solutions are combined and the nal results are stored one per processor in the rst m=n 1 2 columns of the original mesh M. The algorithm to solve any instance of a MQ problem consists of three stages [6]: 1. Replicate all m queries in every submesh M ij. This can be done as shown in [3, 4, 6] in O(n 1 6 m 1 3 ) time. Notice that, at this point, the original problem is partitioned into several instances, each of them on a submesh M ij (see gure 5.a-b). 13
2. Compute in parallel for every submesh M ij the solution to the local instance of the MQ problem. 3. Combine the solutions of the local instances of the MQ problem, obtained in stage 2, and compute the nal solution to the MQ problem (see gure 5.b-c). This can be also done in O(n 1 6 m 1 3 ) time [3, 4, 6]. 4.2 The algorithm The MQGDC problem can be formulated as an instance of MQ with the following parameters: the set A = fa 1 ; a 2 ; : : : ; a n g of items, where every a i is a triplet (x i ; y i ; f i ); the set Q = fq 1 ; q 2 ; : : : ; q m g of queries, where every q i is a pair (x i ; y i ); the decision problem : Q A! f \yes", \no" g is such that (q i ; a j ) = \yes" if and only if q i dominates a j, i.e. a j q i ; for every i (1 i m), let S i = fa j1 ; a j2 ; : : : a jk g be the set of items a j in A for which the answer to (q i ; a j ) = \yes". We take f(s i ) = f j1 f j2 : : : f jk. Our algorithm to solve MQGDC problem is based on the generic MQ algorithm. Since stages 1 and 3 are basically the same for any instance of the MQ problem, the remainder of this section is devoted to stage 2 implementation. After stage 1, every M ij contains a local instance of the original MQGDC problem involving sets A ij = fa k1 ; a k2 ; : : : ; a ks 2g and Q = fq 1 ; q 2 ; : : : ; q m g, where A ij is the subset of items in A stored on the submesh M ij. Next, we show how the local instances of the MQGDC can be solved in parallel by applying GDC. Let A 0 = fa 0 1; a 0 2; : : : ; a 0 g be a set of triplets, s 2 +m a0 i = (x 0 i; yi; 0 fi) 0 (1 i s 2 + m), such that a 0 i = a ki (i.e. x 0 i = x ki, y 0 i = y ki, f 0 i = f ki ) for every 1 i s 2 and a 0 = q s 2 +i i (i.e. x 0 = x s 2 +i i, y 0 = y s 2 +i i, f 0 s 2 +i = identity element of \") for every 1 i m. It is easy to see that using the above mapping scheme we have reduced MQGDC problem to an instance of GDC problem of size m + s 2, that according to theorem 1 can be solved in O( p m + s 2 ) on a mesh connected computer of size p m + s 2 p m + s 2. 14
But, as proved in [7], any algorithm A with an input of size n that takes O(f(n)) to run on a mesh connected computer of size n r n c, also takes O(f(n)) to run on a mesh connected computer of size nr a nc, where a and b are two constants. By taking b a = b = p m+s 2 s (since s 2 m, we have 1 a; b < 2), it is clear that the algorithm to solve the GDC problem of size m + s 2 takes O(s) on a mesh connected computer of size s s. Therefore, we can solve the local instance of the MQGDC problem on every submesh M ij in O(s). Because our algorithm is designed for mesh connected computers, it does not use the row and column buses, and therefore every mesh M ij can compute its local solution in parallel. Finally, we have the following result: Theorem 3 An instance of MQGDC problem involving a set A of n items stored one per processor and a set Q of m (1 m n) queries stored one per processor in the rst mp n columns of a mesh with multiple broadcasting, of size p n p n, can be solved in O(n 1 6 m 1 3 ) time. Moreover, this time is optimal. Proof. The rst part of the proof follows clearly from the algorithm. The proof of optimality is similar to the one of Theorem 2 (see [6] for details) and is based on the observation that the computation cannot terminate until some m processors learn about all the ordered pairs of the chartesian product Q A. We prove this claim by contradiction. Assume the information about a particular ordered pair (q l ; a m ) is not propagated to some m processors (that compute the nal results). But then we cannot compute the nal solution D(q l ) since there is no way to know whether D(q l ) depends or does not depend on the value of the f-component of a m (this is because we can arbitrarily chose a m to either dominate, or not to be dominated by q l ). But, as shown in [6], only to learn about all ordered pairs (q l ; a m ) takes (n 1 6 m 1 3 ), which completes the proof. 5 Conclusions In this paper we have introduced the generalized dominance computation (GDC) problem and we have given a time-optimal algorithm that solves any instance of GDC problem, involving a set A of size n, in ( p n) time on a mesh connected computer of size 15
O( p n p n). Next, we have demonstrated the power of GDC paradigm by deriving several well-known computational geometry problems as ECDF searching and maximal vectors. Although all of this problems can be solved using generalized prex computation (GPC) [1, 14] technique, our solutions for this type of problems are simpler. Due to its large communication diameter, the mesh connected computers tend to be slow when data transfer operations over large distances must be handled. In an attempt to solve this problem, mesh connected computers have been recently enhanced by the addition of row and column busses. Further, as a natural extension of the GDC problem we have introduced the multiple-query generalized dominance computation (MQGDC) problem. By using the generalized multiple search (MQ) paradigm [4, 6], we have devised a time-optimal algorithm that solves any instance of MQGDC problem, involving a set A of n items and a set Q of m (1 m n) queries, in O(n 1 6 m 1 3 ) on a meshes with multiple broadcasting of size O( p n p n). 6 Acknowledgements I am grateful to Prof. Stephan Olariu of Old Dominion University who helped and constantly encouraged me during my work. Special thanks to Prof. Larry Wilson for many insights and discussions that helped in improving the paper. I thank to Vasu Bokka for his patience in explaining the MQ paradigm and for many stimulating discussions. References [1] S. G. Akl, and K. A. Lyons, \Parallel Computational Geometry," Prentice Hall, 1993. [2] A. Bar-Noy, and D. Peleg, \Square meshes are not always optimal," IEEE Trans. on Computers, C-40, 1991, 196{204. [3] D. Bhagavathi, V. Bokka, H. Gurla, R. Lin, S. Olariu, J. L. Schwing, W. Shen, and L. Wilson, \Time-Optimal Rank Computations on Meshes with Multiple Broadcasting," Proc. International Conference on Parallel Processing, St.-Charles, Illinois, August 1994, III, 35{38. 16
[4] V. Bokka, H. Gurla, S. Olariu, J. L. Schwing, and L. Wilson, \A Framework for Solving Geometric problems on Enhanced Meshes", Proc. International Conference on Parallel Processing, Oconomowok, Wisconsin, August 1995, III, 172{175. [5] S. H. Bokhari, Finding maximum on an array processor with a global bus, IEEE Trans. Comput., vol C-33, 1984, 133{139. [6] V. Bokka, \A Computational Paradigm on Network-based Multiprocessor Systems," Doctoral Dissertation, in preparation, Old Dominion University, 1995. [7] I. Stoica, \A Time-Optimal Multiple-Query Nearest-Neighbor Algorithm on Meshes with Multiple Broadcast", to appear in International Journal of Pattern Recognition and Articial Intelligence, vol. 9, No. 4., 1995. [8] V. P. Kumar and C. S. Raghavendra, \Array processor with multiple broadcasting," Journal of Parallel and Distributes Computing, vol. 2, 1987, pp. 173{190. [9] V. P. Kumar and D. I. Reisis, \Image computation on meshes with multiple broadcasting," Trans. On Pattern Analysis and Machine Intelligence, vol. 11, no. 11, 1989, pp. 1194{1201. [10] H. Li and M. Maresca, \Polymorphic-torus network," IEEE Transactions on Computers, vol. C-38, no. 9, (1989) 1345{1351. [11] R. Miller, and Q. F. Stout, \Mesh Computer Algorithms for Computational Geometry," IEEE Trans. on Computers 38 (1989) 321{340. [12] D. Nassimi and S. Sahni, \Finding connected components and connected ones on a mesh-connected parallel computer," SIAM Journal on Computing 9 (1980) 744{757. [13] D. Parkinson, D. J. Hunt, and K. S. MacQueen, \The AMT DAP 500," Proc. 33-rd IEE Comp. Soc. International Conf., 1988, pp. 196{199. [14] F. Springsteel and I. Stojmenovic. \Parallel general prex computation with geometric, algebraic and other applications," International Journal of Parallel Programming., Vol. 18, No. 6, December 1989, pp 485{503. 17
[15] C. D. Thomson and H. T. Kung. \Sorting on a Mesh-Connected Parallel Computer," Communications of the ACM., Vol. 20, No. 4, April 1977, pp 263{271. 18