on Mesh Connected Computers and Meshes with Multiple Broadcasting Ion Stoica Abstract

Similar documents
Yi Pan, S.Q. Zheng y, Keqin Li z, and Hong Shen x. Dept. of Computer Science, University ofdayton, Dayton, OH

Bit Summation on the Recongurable Mesh. Martin Middendorf? Institut fur Angewandte Informatik

An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching

Constant Time Algorithms for Computing the Contour of Maximal Elements on the Reconfigurable Mesh

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

A Parallel Algorithm for Minimum Cost Path Computation on Polymorphic Processor Array

In Proc. DARPA Sortware Technology Conference 1992, pp and Recongurable Meshes. Quentin F. Stout. University of Michigan

Minimum-Cost Spanning Tree. as a. Path-Finding Problem. Laboratory for Computer Science MIT. Cambridge MA July 8, 1994.

Computing intersections in a set of line segments: the Bentley-Ottmann algorithm

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

A Combined BIT and TIMESTAMP Algorithm for. the List Update Problem. Susanne Albers, Bernhard von Stengel, Ralph Werchner

A technique for adding range restrictions to. August 30, Abstract. In a generalized searching problem, a set S of n colored geometric objects

3 No-Wait Job Shops with Variable Processing Times

[8] that this cannot happen on the projective plane (cf. also [2]) and the results of Robertson, Seymour, and Thomas [5] on linkless embeddings of gra

However, m pq is just an approximation of M pq. As it was pointed out by Lin [2], more precise approximation can be obtained by exact integration of t

An Optimal Algorithm for the Euclidean Bottleneck Full Steiner Tree Problem

6. Concluding Remarks

Byzantine Consensus in Directed Graphs

Sparse Hypercube 3-Spanners

Computational Geometry

Collaborative and Distributed Computation in Mesh-like Wireless Sensor Arrays

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering

Number Theory and Graph Theory

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

Chapter 8. Voronoi Diagrams. 8.1 Post Oce Problem

A Reconfigurable Network Architecture For Parallel Prefix Counting

Y. Han* B. Narahari** H-A. Choi** University of Kentucky. The George Washington University

A COMPARISON OF MESHES WITH STATIC BUSES AND HALF-DUPLEX WRAP-AROUNDS. and. and

[13] D. Karger, \Using randomized sparsication to approximate minimum cuts" Proc. 5th Annual

On The Complexity of Virtual Topology Design for Multicasting in WDM Trees with Tap-and-Continue and Multicast-Capable Switches

arxiv: v3 [cs.dm] 12 Jun 2014

SHARED MEMORY VS DISTRIBUTED MEMORY

Line Arrangements. Applications

Ray shooting from convex ranges

STRAIGHT LINE ORTHOGONAL DRAWINGS OF COMPLETE TERNERY TREES SPUR FINAL PAPER, SUMMER July 29, 2015

On the Rectangle Escape Problem

Revised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN

Notes in Computational Geometry Voronoi Diagrams

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp

Efficient Prefix Computation on Faulty Hypercubes

Interleaving Schemes on Circulant Graphs with Two Offsets

A Distributed Formation of Orthogonal Convex Polygons in Mesh-Connected Multicomputers

Geometry. Geometric Graphs with Few Disjoint Edges. G. Tóth 1,2 and P. Valtr 2,3. 1. Introduction

Madhusudan Nigam and Sartaj Sahni. University of Florida. Gainesville, FL <Revised July 1992> Technical Report 92-5 ABSTRACT

1. Meshes. D7013E Lecture 14

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

Vertex Magic Total Labelings of Complete Graphs 1

Minimizing Total Communication Distance of a Time-Step Optimal Broadcast in Mesh Networks

HW Graph Theory SOLUTIONS (hbovik) - Q

Maximal Independent Set

All 0-1 Polytopes are. Abstract. We study the facial structure of two important permutation polytopes

by conservation of flow, hence the cancelation. Similarly, we have

On the perimeter of k pairwise disjoint convex bodies contained in a convex set in the plane

Pebble Sets in Convex Polygons

Edge disjoint monochromatic triangles in 2-colored graphs

On the positive semidenite polytope rank

Exemples of LCP. (b,3) (c,3) (d,4) 38 d

Enumerating Independent Sets In Trees And. Submitted in partial fulllment of the requirements. for the degree of. Bachelor of Technology

Hyper-Butterfly Network: A Scalable Optimally Fault Tolerant Architecture

Parameterized Complexity of Independence and Domination on Geometric Graphs

On the Complexity of Multi-Dimensional Interval Routing Schemes

Bar k-visibility Graphs: Bounds on the Number of Edges, Chromatic Number, and Thickness

Hypercubes. (Chapter Nine)

Module 7. Independent sets, coverings. and matchings. Contents

A WILD CANTOR SET IN THE HILBERT CUBE

9.1. K-means Clustering

THE ISOMORPHISM PROBLEM FOR SOME CLASSES OF MULTIPLICATIVE SYSTEMS

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

22 Elementary Graph Algorithms. There are two standard ways to represent a

Message-Optimal Connected Dominating Sets in Mobile Ad Hoc Networks

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

On the number of distinct directions of planes determined by n points in R 3

A Low-Overhead DVR Based Multicast Routing Protocol for Clustered MANET

Figure 1: The three positions allowed for a label. A rectilinear map consists of n disjoint horizontal and vertical line segments. We want to give eac

Stabbers of line segments in the plane

Rigidity, connectivity and graph decompositions

1 Linear programming relaxation

Approximation Algorithms for Wavelength Assignment

2 Geometry Solutions

(Refer Slide Time: 0:19)

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

Lecture 2: Divide and Conquer

An Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can

On the Max Coloring Problem

COMP260 Spring 2014 Notes: February 4th

Honeycomb Networks: Topological Properties and Communication Algorithms

Optimal Assignments in an Ordered Set: An Application of Matroid Theory

Minimal Steiner Trees for Rectangular Arrays of Lattice Points*

A note on Baker s algorithm

II (Sorting and) Order Statistics

CHAPTER 8. Copyright Cengage Learning. All rights reserved.

Computing Submesh Reliability in Two-Dimensional Meshes

Intersection of sets *

Partitions and Packings of Complete Geometric Graphs with Plane Spanning Double Stars and Paths

On Computing the Centroid of the Vertices of an Arrangement and Related Problems

The Round Complexity of Distributed Sorting

Exercise set 2 Solutions

Bar k-visibility Graphs

Transcription:

Time-Optimal Algorithms for Generalized Dominance Computation and Related Problems on Mesh Connected Computers and Meshes with Multiple Broadcasting Ion Stoica Abstract The generalized dominance computation (GDC) problem is stated as follows: Let A = fa 1 ; a 2 ; : : :; a n g be a set of triplets, i.e. a i = (x i ; y i ; f i ), \< " be a linear order relation dened on x-components, \<" be a linear order relation dened on y- components and \" an abelian operator dened on f-components. It is required to compute for every a i 2 A, the expression D(a i ) = f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : :; j k g is the set of all indices j such that a j 2 A and x j < x i, y j < y i. First, this paper presents a time-optimal algorithm to solve the GDC problem in O( p n) on a mesh connected computer of size p n p n. To prove the generality of our approach, we show how a number of computational geometry problems such as ECDF (empirical cumulative distribution function) searching and two-set dominance counting, can be derived from GDC problem. Second, we dene a natural extension of the GDC, called multiple-query generalized dominance computation (MQGDC), on meshes with multiple broadcasting. By using multiple querying (MQ) paradigm of Bokka et al. [3, 4, 6] we devise a time-optimal algorithm that solves a MQGDC problem involving a set A of n items and a set Q of m queries in O(n 1 6 m 1 3 ) on a mesh with multiple broadcasting of size p n p n. Keywords: mesh connected-computers, broadcasting, multiple buses, computational geometry, parallel algorithms, generalized dominance computation, multiplequery, generalized multiple search, generalized prex computation. Department of Computer Science, Old Dominion University, Norfolk, VA 23529-0162 (stoica@cs.odu.edu). 1

1 Introduction A mesh-connected computer (simply known as mesh) of size n 1 n 2 consists of n 1 n 2 identical processors arranged on a n 1 n 2 grid where each processor is connected to its four neighbors by bidirectional links. Each processor has a xed number of registers, each of size O(log n 1 n 2 ), and can perform standard arithmetic and boolean operations in unit time. Each processor can also send the contents of a register to one of its neighbors and receive data from a neighbor in a special register in unit time. A mesh is assumed to function in a SIMD mode; all processors are synchronized and operate under the control of a single instruction stream issued by a control unit. Due to their simple interconnection topology and to the fact that many problems can be easily mapped on them, mesh-connected computers have become a popular choice for solving a large number of problems in image processing, computational geometry and pattern recognition [12, 11]. Unfortunately, meshes suer from major limitations when data need to be transferred over long distances. A natural solution to this problem was to add row and column buses to the existing meshes [5, 8, 9, 10]. These meshes, known as meshes with multiple broadcasting (MMB for short) have already been implemented and are currently available [13]. At any time only one processor can broadcast its data on a given bus. On the other hand, all processors connected to a bus can concurrently read the data broadcast on that bus. Throughout this paper the communication along column and row buses is assumed to take unit time, independent of the length of the bus [2, 5, 8, 13]. The GDC problem is a generalization of the well-known empirical cumulative distribution function (ECDF) introduced by Springsteel and Stojmenovic in [14]. The ECDF problem that can be formulated as follows: Given a set S = fs 1 ; s 2 ; : : : ; s n g of n points in plane, for every point s i 2 S count the number of points in S that are dominated by s i (we say, that a point s i dominates a point s j, if and only if the x-coordinate of s i is larger than the x-coordinate of s j, and the y-coordinate of s i is larger than the y-coordinate of s j ). The GDC problem generalizes the linear order relations on both x and y coordinates and the counting operation. Formally, the GDC problem is stated as follows: Let A = fa 1 ; a 2 ; : : : ; a n g be a set of triple, i.e. a i = (x i ; y i ; f i ), let \<" be a linear order relation dened on x-components, let \<" be a linear order relation dened on y- 2

components and let \" be an abelian operator dened on f-components. The problem requires to compute for every a i 2 A, the expression D(a i ) = f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j such that a j 2 A and x j < x i, y j < y i. In this paper, we present a time-optimal algorithm to solve the GDC problem, involving a set A of n items, in O( p n) on a mesh connected computer of size p n p n. To prove the generality of our approach, we show how a number of computational geometry problems, including two-set dominance counting and maximal vectors, can be derived from GDC problem. Next, to take advantage on the architecture of meshes with multiple broadcasting we dene a natural extension of the GDC, called multiple-query generalized dominance computation (MQGDC). By using multiple querying (MQ [6] 1 ) paradigm we devise a time-optimal algorithm that solves a MQGDC problem involving a set A of n items and a set Q of m queries in O(n 1 6 m 1 3 ) on a mesh with multiple broadcasting of size p n p n. The remainder of this paper is organized as follows: section 2 presents a time-optimal algorithm to solve the GDC problem on mesh connected computers; section 3 presents some applications of GDC paradigm to several problems in computational geometry; section 4 denes the MQGDC problem and presents a time-optimal algorithm on meshes with multiple broadcasting; section 5 summarizes our ndings and indicates some possible directions for the future work. 2 A Time-Optimal GDC Algorithm on Mesh Connected Computers We dene the general dominance computation (GDC) problem as follows: Let A = fa 1 ; a 2 ; : : : ; a n g be a set of n items, where every item in A is a triplet, i.e. a k = (x k ; y k ; f k ) (1 k n). Further, consider: an abelian operator \" dened on the set of f-components, f 1 ; f 2 ; : : : ; f n a linear order relation \<" dened on the set of x-components, x 1 ; x 2 ; : : : ; x n 1 In an early draft of [6] the MQ paradigm has been known as generalized multiple search (GMS) paradigm, and this is the name under which it was reered in some previous papers (ex., [7]). 3

a linear order relation \<" dened on the set of y-components, y 1 ; y 2 ; : : : ; y n The problem requires to compute for every item a i 2 A, the expression D(a i ) = f j1 f j2 : : :f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j such that a j 2 A and x j < x i, y j < y i. In this section, we present a time-optimal algorithm to solve the GDC problem involving a set of n items, A = fa i j 1 i ng, on a mesh connected computer of size p n p n. It is convenient to interpret every triple ak 2 A as a point in a plane, where x k and y k are the point coordinates and f k is some value associated to it (e.g. if the point represents a pixel of an image, than its value can be the pixel intensity). We say that an item a i dominates an item a j, and we write a j a i, if and only if x j < x i and y j < y i. For any subset A 0 of A, let D(a m ; A 0 ) be the expression f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j such that a j 2 A 0 and x j < x m, y j < y m (further we use both D(a k ; A) and D(a k ) notations, interchangeable). For example, in gure 1, D(a 7 ) = f 3 f 4 f 5 f 8 f 14 and D(a 7 ; A 0 ) = a 3 a 14. y a13 A a12 a2 a11 a8 a 14 a7 a6 a3 a1 a16 a9 a5 a4 a10 a15 x Figure 1: An instance of the GDC problem. Lemma 1 Given two disjoint subsets A 0, A 00 of A, then D(a k ; A 0 [ A 00 ) = D(a k ; A 0 ) D(a k ; A 00 ) for every a k 2 A. 4

Proof. First, for every value f i in expression D(a k ; A 0 [ A 00 ) we have, from the denition, a i 2 A 0 [ A 00 and x i < x k, y i < y k. Thus, f i is either include in D(a k ; A 0 ) or in D(a k ; A 00 ) and therefore f i is also included precisely once D(a k ; S 0 ) D(a k ; A 00 ). Conversely, for every value f i in expression D(a k ; A 0 ) D(a k ; A 00 ), we have x i < x k, y i < y k and either a i 2 A 0 or a i 2 A 00 (i.e. a i 2 A 0 [ A 00 ) and therefore f i is also in D(a k ; A 0 [ A 00 ). Thus, both D(a k ; A 0 [ A 00 )and D(a k ; A 0 ) D(a k ; A 00 ) contain the same items, and every item occurs exactly once in each sequence. Since operator is both associative and commutative, this ensures that D(a k ; A 0 [ A 00 ) = D(a k ; A 0 ) D(a k ; A 00 ) h5 h4 X33 h3 h2 h 1 a13 a14 a8 a5 a2 a12 a3 a4 a11 a7 A 33 a1 a16 a6 a9 a10 a15 v 1 v S 2 v3 33 Y v4 v *3 5 Figure 2: An example of partitioning of a set of 16 items in xy-plane, by 5 vertical and 5 horizontal line. X 33 = A 31 [ A 32, Y 3 = A 13 [ A 23 [ A 33 [ A 43 and S 33 = A 11 [ A 12 [ A 21 [ A 22. The idea of the algorithm is to partition the set A in disjoint subsets on both x and y-components (coordinates). Intuitively, this can be viewed as a partition of the xy-plane by m h horizontal and m v vertical lines, such that all items in A lie between the extreme lines (see gure 2). More precisely, let h 1 < h 2 < : : : < h mh be a sorted sequence of y-coordinates and v 1 < v 2 < : : :< v mv be a sorted sequences of x-coordinates, such that all the points in A lie in the region delimited by h 1, h mh, v 1, v mv, i.e. for every a k 2 A, 5

v 1 < x k < v mv and h 1 < y k < h mh. Next, let us denote by A ij the set of points that lie in the region delimited by h i, h i+1 and and v j, v j+1, i.e. A ij = fa k 2 A j v i < x i < v i+1 and h j < y k < h j+1 g. It is clear, that for any item a k 2 A ij, all the points it dominates (D(a k )) are contained in the sets A lm, where l i and m j (see gure 2). Further, for every set A ij we dene the following related sets (see gure 2): [ S ij = A lm ; 1 l < i; 1 m < j [ [ X i = [ A im ; 1 m m v ; Y j = [ A lj ; 1 l m h X ij = A im ; 1 m < j; Y ij = A lj ; 1 l < i (1) It is easy to see (gure 2) that S ij, Y j and X ij are disjoint and their union contains all sets A lm that can contain all items in A dominated by any item in A ij. Using lemma 1, the solution of every item a k 2 A ij can be written as : D(a k ) = D(a k ; Y j ) D(a k ; X ij ) D(a k ; S ij ) (2) The above equation is in fact the core of our algorithm, which can be easily divided into 3 stages: 1. for every point a k 2 A, such that a k 2 A ij, compute the partial solution D(a k ) = D(a k ; Y j ); 2. for every point a k 2 A, such that a k 2 A ij, compute D(a k ; X ij ) and update the partial solution D(a k ) = D(a k ) D(a k ; X ij ); 3. for every point a k 2 A, such that a k 2 S ij, compute D(a k ; S ij ) and the nal solution D(a k ) = D(a k ) D(a k ; S ij ); Notice that all items in S ij are dominated by every item in A ij. Therefore, it is enough to perform stage 3 only once for all items in A ij. The remainder of this section shows how each stage of the algorithm is implemented on a mesh connected computer of size p n p n. Every processor of the mesh stores one item a i from A. For computation purpose, every item a k 2 A, besides its three components x k, y k and f k contains two other ones: col k and row k. These components 6

represent the indices of the set A ij to which a k belongs, i.e. a k 2 A rowk col k (row k = i, col k = j). Stage 1. First, we sort all items a i by their x-component in column major order (i.e. P (1; 1) contains the a i with the smallest x i, P (2; 1) contains the a i with the second smallest x i, etc.) and initialize the solutions D(a i ) (1 i n) to the identity element of \" (see gure 3.a). Now, consider the natural partition of the items on y-components according to columns, a k 2 P (i; j), a k 2 Y j. Therefore, if a k 2 P (i; j), col k is initialized to j. Next, we propagate a copy of every item along mesh columns such that every processor receives a copy of each item stored by a processor in that column. Let a k be the item stored on the processor P (i; j) and a l be a copy of an item it receives during the above operations. If a l a k, then P (i; j) updates D(a k ), D(a; k) = D(a; k) f l. It is easy to see that after P (i; j) is visited by all items stored by processors on the same column, D(a k ) = D(a k ; Y j ) and the rst stage is completed. Since the sorting operation can be performed in O( p n) time on a mesh parallel computer of size p n p n ([15]), and the propagation of an item to all the other items on the same column takes also O( p n) time, it is clear that stage 1 takes O( p n) time. a a a a 8 12 7 10 a a 4 15 a5 a10 a13 a 4 a16 a6 a16 a 3 a9 a1 a5 a3 a11 a15 a8 a7 a14 a6 a14 a2 a1 a9 a12 a11 a13 a2 a) b) Figure 3: The items in gure 1 sorted in column major order by their x-component (a) and in row major order by their y-component (b). The P(1, 1) is the top-leftmost processor and P(4, 4) is the bottom-rightmost processor of the mesh. Stage 2. This stage is very similar to the previous one. We sort all points a i by their y-component in row major order (i.e. P (1; 1) contains the a i with the smallest y i, P (1; 2) 7

contains the a i with the second smallest y i, etc.) and we consider the natural partition of the items on their x-component according to rows, a k 2 P (i; j), a k 2 X i (gure 3.b). Therefore, if a k 2 P (i; j), row k is initialized to i. Notice that at this point the sets A ij, 1 i; j n are well dened: A ij = fa k 2 A j row k = i ^ col k = jg. Next, analogous to the previous stage, we propagate a copy of every item along the mesh rows to each other processor on that row. Now, let a k be the item stored on the processor P (i; j) and a l be a copy of an item it receives. If a l a k and col l < col k then P (i; j) updates D(a k ), D(a k ) = D(a k ) f l. Notice that in this case an additional test, col l < col k, is performed. This ensures that only the items in X ij are considered and therefore at the end of this stage we have: D(a k ) = D(a k ; Y j ) D(a k ; X ij ). As the previous stage, stage 2 requires O( p n) time. Stage 3. This stage computes the last term of D(a k ) from equation 2, i.e. D(a k ; S ij ). For this, every processor maintains two local variables b ij and s ij initialized to the identity element of \". This stage consists of two phases. First, a copy of every item is propagated along the mesh rows to each processor on that row. When the processor P (i; j) receives a copy of the item a k, it checks whether col k < j. If this is true, then it updates b ij, b ij = b ij f k. Since all the items on the same row have row-components equal, at the end of this phase, b ij = f l1 f l2 : : :f lm where X ij = fa l1 ; a l2 ; : : : ; a lm g. Notice that b ij could be computed concurrently with D(a k ; X ij ) in the previous phase. The only reason we have not compacted these computations is to increase the clarity of the algorithm. In the second phase, all values b ij are propagated along their corresponding column j. In this way, every processor P (i; j) receives all the b lj values, where l < i. Upon receiving b lj the processor P (i; j) updates its variable s ij, s ij = s ij b lj. Therefore, at the end this phase, s ij = b 1j b 2j : : : b i?1;j. Since S ij = X 1j [ X 2j [ : : : [ X i?1;j, from lemma 1, we obtain D(a k ; S ij ) = s ij. Next, every value s ij is propagated to every processor on the same row. Upon receiving s ij, every processor P il which stores item a k checks whether col k = j. If this is the case, then the nal value of D(a k ) is computed, D(a k ) = D(a k ) s ij = D(a k ; Y j ) D(a k ; X ij ) D(a k ; S ij ). Since this stage requires only propagations on rows or columns, it takes O( p n) time and therefore all stages can be performed in O( p n) time. To prove that the algorithm is 8

time-optimal is trivial. Consider the initial distribution of the items on the mesh, such that the item a k stored on processor P (1; 1) is dominated by the item a l stored on the processor P ( p n; p n). For computing D(a l ) we need f k, but since the distance between P (1; 1) and P ( p n; p n) is O( p n) this cannot be done faster than ( p n). Thus, we have the following result. Theorem 1 The GDC problem involving a set A of n items can be solved in O( p n) time on a mesh connected computer of size p n p n. Moreover, this time is optimal. 3 Some GDC Applications on Mesh Connected Computers To demonstrate the power of the GDC we now give some examples of geometry computational problems for sets of points in plane that can be reduced to the GDC problem. Let S = fs 1 ; s 2 ; : : : ; s n g be a set of points in plane and x(s i ), y(s i ) be x and respective y-coordinate of s i in plane. We say, that a point s i dominates a point s j if and only if x j < x i and y j < y i. 1. ECDF (empirical cumulative distribution function) searching problem. Determine for every point s i 2 S the total number of points in S dominated by s i. The corresponding instance of the GDC problem has the following parameters: x k = x(s k ), y k = y(s k ); f k = 1; \<", \<" = < dened on R; \" = + The result for s i is D(a i ). 2. Two-set dominance counting problem. Given two disjoint sets of points in plane S 1 and S 2, determine for every point s i 2 S 2 the number of points in S 1 dominated by s i. By denoting to S = S 1 [ S 2, the corresponding instance of the GDC problem has the following parameters: f k = 8 >< >: 1 if s k 2 S 1 0 if s k 2 S 2 9

x k = x(s k ), y k = y(s k ) for every s k 2 S; \<", \<" = < dened on R; \" = + The result for s i 2 S 2 is D(a i ). 3. Maximal vectors. Determine all the points in S that are not dominated by any other point in S. The corresponding instance of the GDC problem has the following parameters: x k = x(s k ), y k = y(s k ); f k = 1; \<", \<" = > dened on R; \" = + Where s i is maximal if and only if D(a i ) = 0. Notice that, in this case, for every s i 2 S, the corresponding D(a i ) represents the number of points s j 2 S such that x(s j ) > x(s i ) and y(s j ) > y(s i ), i.e. the number of all points S that dominate s i. Therefore, the D(a i ) = 0 if and only if s i is not dominated by any other point in S. Although these problems can be also solved using the generalized prex computation (GPC) technique, as shown in [1] and [14], we think that our approach is more direct and elegant for the above examples. 4 A Time-Optimal MQGDC Algorithm on Meshes with Multiple Broadcasting Although meshes with multiple broadcasting handle data transfer operations over large distances much faster than mesh connected computers, they cannot signicantly \speed up" the algorithms for dense problems as GDC. To see why, let us take an instance of the GDC problem with an input A, of size n, partitioned into two equal sized sets A 0 and A 00, such that every item a i 2 A 0 does not dominate any other item a j 2 A and every item a i 2 A 00 is a maximal element for A (i.e. is not dominated by any other element in A) and dominates exactly one item in A 0 (see gure 4). Clearly, the solution for every a i 2 A 00 i is D(a i ) = f j, where a j is the item in A 0 dominated by a i. Now, consider that the items in A 0 are stored one per processor in the rst p n 2 columns, and the items in A 00 10

are stored one per processor in the last p n columns of a mesh with multiple broadcasting 2 of size p n p n (see gure 4). y A A" a1 an/2+1 an/2+2 a2 a n/2+3 a3 n P n/2 n/2 A A" an/2 an x a) b) Figure 4: An instance of GDC problem used to prove the lower bound on a mesh with multiple broadcasting: a) every item from set A 00 dominates exactly one item from set A 0 ; b) all items of set A 0 are stored in the rst n 1 2 =2 columns, and all items of set A 00 are stored in the last n 1 2 =2 columns of a mesh with multiple broadcasting of size n 1 2 n 1 2. To compute the nal solution for every a i 2 A 00 it is clear that either a i, or the corresponding dominated point a j must cross the line P that separates rst p n 2 columns from the last p n 2. Since there are n=2 such pairs and only p n items can traverse plane P at one moment, it is clear that any algorithm that correctly solves the GDC problem takes at least ( p n). Thus, we have the following result: Lemma 2 Any algorithm that correctly solves the GDC problem, involving a set A of n items, on a mesh connected computer of size p n p n takes at least ( p n) time. However, in many practical applications we are not interested to compute the solutions for all the items in the set A, but rather for a sub-set of items. Therefore, as a natural extension of the GDC problem we dene the multiple-query generalized dominance computation (MQGDC) 2 as follows: Let A = fa 1 ; a 2 ; : : : ; a n g be a set of items and 2 As we will show the MQGDC also help us to take advantage of the mesh with multiple broadcasting architecture. 11

Q = fq 1 ; q 2 ; : : : ; q m g (1 m n) a set of queries, where every a i is a triplet (x i ; y i ; f i ) and every q i is a pair (x i ; y i ). Further, consider: an abelian operator \" dened on the set of f-components of the items in A a linear order relation \<" dened on the set of x-components of the items in both A and Q a linear order relation \<" dened on the set of y-components of the items in both A and Q The problem requires to compute for every item q i 2 Q, the expression D(q i ) = f j1 f j2 : : : f jk, where fj 1 ; j 2 ;: : : ; j k g is the set of all indices j of items in A for which x j < x i and y j < y i. Notice that this problem can be viewed as a generalization of the two-set dominance problem in the same sense in which GDC can be viewed as a generalization of the ECDF problem. Therefore, we can interpret, again, every triplet a k 2 A as a point in plane, where x k and y k are the point coordinates and f k is some value associated to it. In the same manner every pair q k 2 Q is interpreted as a point with coordinates x k and y k. Then, the solution of the MQGDC problem is to determine for every query q k 2 Q the -sum over all values f i of all points a i 2 A dominated by q k. Now it is obvious that, if we take as being the summation operator (+) and f k = 1 for every a k 2 A we can derive the two-set dominance problem. 4.1 Multiple Querying on Meshes with Multiple Broadcasting In solving the MQGDC problem on a mesh with multiple broadcasting we use a new powerful paradigm that was recently developed by Bokka [3, 4, 6] to solve the multiple querying (MQ) problem on MMBs. The MQ problem is stated as follows [6]: Consider collections A = fa 1 ; a 2 ; : : : ; a n g of items and Q = fq 1 ; q 2 ; : : : ; q m g (1 m n) of queries, and a decision problem : QA! f\yes", \no"g. For every i (1 i m), let S i be the set of items a j 2 A for which (q i ; a j ) = \yes", and let f be an abelian semigroup-type function operating on S i. The problem requires to determine for every q i (1 i m) the corresponding f(s i ) (where f(s i ) is called the solution of q i ). 12

For completeness the theorem for lower bound of MQ is stated and the algorithm for solving MQ optimally is outlined [6]: Theorem 2 (BOK94) Any algorithm that correctly solves the MQ problem involving a set A of n items stored one per processor and a set Q of m (1 m n) queries stored m one per processor in the rst pn columns of a mesh with multiple broadcasting of size p p 1 n n must take at least (n 6 m 1 3 ) time. Let M be a mesh with multiple broadcasting of size p n p n. The items a i 2 A are stored one per-processor in M, while the queries q i 2 Q are stored one per processor in the rst m pn columns of M. For simplicity, let us denote to s = n 1 6 m 1 3. Next, consider a partition of the initial mesh M in square submeshes of size s s denoted M ij (1 i; j p n s ), where M ij contains the processors located in (i? 1)s + 1; : : : ; is rows and (j? 1)s + 1; : : : ; js columns in M (see gure 5). n m/ n s m/ n M 11 M M 12 13 s M 11 M12 M13 M 11 M M 12 13 n M M M 21 22 23 M M M 21 22 23 M M M 21 22 23 M M M 31 32 33 M31 M32 M33 M31 M32 M33 a) b) c) Figure 5: Essential data movement involved in the MQ generic algorithm: a) queries are stored one per processor in the rst m=n 1 2 columns of the original mesh M; b) queries are replicated on every submesh M ij of size s s; c) after each submesh M ij solves the local problem, the solutions are combined and the nal results are stored one per processor in the rst m=n 1 2 columns of the original mesh M. The algorithm to solve any instance of a MQ problem consists of three stages [6]: 1. Replicate all m queries in every submesh M ij. This can be done as shown in [3, 4, 6] in O(n 1 6 m 1 3 ) time. Notice that, at this point, the original problem is partitioned into several instances, each of them on a submesh M ij (see gure 5.a-b). 13

2. Compute in parallel for every submesh M ij the solution to the local instance of the MQ problem. 3. Combine the solutions of the local instances of the MQ problem, obtained in stage 2, and compute the nal solution to the MQ problem (see gure 5.b-c). This can be also done in O(n 1 6 m 1 3 ) time [3, 4, 6]. 4.2 The algorithm The MQGDC problem can be formulated as an instance of MQ with the following parameters: the set A = fa 1 ; a 2 ; : : : ; a n g of items, where every a i is a triplet (x i ; y i ; f i ); the set Q = fq 1 ; q 2 ; : : : ; q m g of queries, where every q i is a pair (x i ; y i ); the decision problem : Q A! f \yes", \no" g is such that (q i ; a j ) = \yes" if and only if q i dominates a j, i.e. a j q i ; for every i (1 i m), let S i = fa j1 ; a j2 ; : : : a jk g be the set of items a j in A for which the answer to (q i ; a j ) = \yes". We take f(s i ) = f j1 f j2 : : : f jk. Our algorithm to solve MQGDC problem is based on the generic MQ algorithm. Since stages 1 and 3 are basically the same for any instance of the MQ problem, the remainder of this section is devoted to stage 2 implementation. After stage 1, every M ij contains a local instance of the original MQGDC problem involving sets A ij = fa k1 ; a k2 ; : : : ; a ks 2g and Q = fq 1 ; q 2 ; : : : ; q m g, where A ij is the subset of items in A stored on the submesh M ij. Next, we show how the local instances of the MQGDC can be solved in parallel by applying GDC. Let A 0 = fa 0 1; a 0 2; : : : ; a 0 g be a set of triplets, s 2 +m a0 i = (x 0 i; yi; 0 fi) 0 (1 i s 2 + m), such that a 0 i = a ki (i.e. x 0 i = x ki, y 0 i = y ki, f 0 i = f ki ) for every 1 i s 2 and a 0 = q s 2 +i i (i.e. x 0 = x s 2 +i i, y 0 = y s 2 +i i, f 0 s 2 +i = identity element of \") for every 1 i m. It is easy to see that using the above mapping scheme we have reduced MQGDC problem to an instance of GDC problem of size m + s 2, that according to theorem 1 can be solved in O( p m + s 2 ) on a mesh connected computer of size p m + s 2 p m + s 2. 14

But, as proved in [7], any algorithm A with an input of size n that takes O(f(n)) to run on a mesh connected computer of size n r n c, also takes O(f(n)) to run on a mesh connected computer of size nr a nc, where a and b are two constants. By taking b a = b = p m+s 2 s (since s 2 m, we have 1 a; b < 2), it is clear that the algorithm to solve the GDC problem of size m + s 2 takes O(s) on a mesh connected computer of size s s. Therefore, we can solve the local instance of the MQGDC problem on every submesh M ij in O(s). Because our algorithm is designed for mesh connected computers, it does not use the row and column buses, and therefore every mesh M ij can compute its local solution in parallel. Finally, we have the following result: Theorem 3 An instance of MQGDC problem involving a set A of n items stored one per processor and a set Q of m (1 m n) queries stored one per processor in the rst mp n columns of a mesh with multiple broadcasting, of size p n p n, can be solved in O(n 1 6 m 1 3 ) time. Moreover, this time is optimal. Proof. The rst part of the proof follows clearly from the algorithm. The proof of optimality is similar to the one of Theorem 2 (see [6] for details) and is based on the observation that the computation cannot terminate until some m processors learn about all the ordered pairs of the chartesian product Q A. We prove this claim by contradiction. Assume the information about a particular ordered pair (q l ; a m ) is not propagated to some m processors (that compute the nal results). But then we cannot compute the nal solution D(q l ) since there is no way to know whether D(q l ) depends or does not depend on the value of the f-component of a m (this is because we can arbitrarily chose a m to either dominate, or not to be dominated by q l ). But, as shown in [6], only to learn about all ordered pairs (q l ; a m ) takes (n 1 6 m 1 3 ), which completes the proof. 5 Conclusions In this paper we have introduced the generalized dominance computation (GDC) problem and we have given a time-optimal algorithm that solves any instance of GDC problem, involving a set A of size n, in ( p n) time on a mesh connected computer of size 15

O( p n p n). Next, we have demonstrated the power of GDC paradigm by deriving several well-known computational geometry problems as ECDF searching and maximal vectors. Although all of this problems can be solved using generalized prex computation (GPC) [1, 14] technique, our solutions for this type of problems are simpler. Due to its large communication diameter, the mesh connected computers tend to be slow when data transfer operations over large distances must be handled. In an attempt to solve this problem, mesh connected computers have been recently enhanced by the addition of row and column busses. Further, as a natural extension of the GDC problem we have introduced the multiple-query generalized dominance computation (MQGDC) problem. By using the generalized multiple search (MQ) paradigm [4, 6], we have devised a time-optimal algorithm that solves any instance of MQGDC problem, involving a set A of n items and a set Q of m (1 m n) queries, in O(n 1 6 m 1 3 ) on a meshes with multiple broadcasting of size O( p n p n). 6 Acknowledgements I am grateful to Prof. Stephan Olariu of Old Dominion University who helped and constantly encouraged me during my work. Special thanks to Prof. Larry Wilson for many insights and discussions that helped in improving the paper. I thank to Vasu Bokka for his patience in explaining the MQ paradigm and for many stimulating discussions. References [1] S. G. Akl, and K. A. Lyons, \Parallel Computational Geometry," Prentice Hall, 1993. [2] A. Bar-Noy, and D. Peleg, \Square meshes are not always optimal," IEEE Trans. on Computers, C-40, 1991, 196{204. [3] D. Bhagavathi, V. Bokka, H. Gurla, R. Lin, S. Olariu, J. L. Schwing, W. Shen, and L. Wilson, \Time-Optimal Rank Computations on Meshes with Multiple Broadcasting," Proc. International Conference on Parallel Processing, St.-Charles, Illinois, August 1994, III, 35{38. 16

[4] V. Bokka, H. Gurla, S. Olariu, J. L. Schwing, and L. Wilson, \A Framework for Solving Geometric problems on Enhanced Meshes", Proc. International Conference on Parallel Processing, Oconomowok, Wisconsin, August 1995, III, 172{175. [5] S. H. Bokhari, Finding maximum on an array processor with a global bus, IEEE Trans. Comput., vol C-33, 1984, 133{139. [6] V. Bokka, \A Computational Paradigm on Network-based Multiprocessor Systems," Doctoral Dissertation, in preparation, Old Dominion University, 1995. [7] I. Stoica, \A Time-Optimal Multiple-Query Nearest-Neighbor Algorithm on Meshes with Multiple Broadcast", to appear in International Journal of Pattern Recognition and Articial Intelligence, vol. 9, No. 4., 1995. [8] V. P. Kumar and C. S. Raghavendra, \Array processor with multiple broadcasting," Journal of Parallel and Distributes Computing, vol. 2, 1987, pp. 173{190. [9] V. P. Kumar and D. I. Reisis, \Image computation on meshes with multiple broadcasting," Trans. On Pattern Analysis and Machine Intelligence, vol. 11, no. 11, 1989, pp. 1194{1201. [10] H. Li and M. Maresca, \Polymorphic-torus network," IEEE Transactions on Computers, vol. C-38, no. 9, (1989) 1345{1351. [11] R. Miller, and Q. F. Stout, \Mesh Computer Algorithms for Computational Geometry," IEEE Trans. on Computers 38 (1989) 321{340. [12] D. Nassimi and S. Sahni, \Finding connected components and connected ones on a mesh-connected parallel computer," SIAM Journal on Computing 9 (1980) 744{757. [13] D. Parkinson, D. J. Hunt, and K. S. MacQueen, \The AMT DAP 500," Proc. 33-rd IEE Comp. Soc. International Conf., 1988, pp. 196{199. [14] F. Springsteel and I. Stojmenovic. \Parallel general prex computation with geometric, algebraic and other applications," International Journal of Parallel Programming., Vol. 18, No. 6, December 1989, pp 485{503. 17

[15] C. D. Thomson and H. T. Kung. \Sorting on a Mesh-Connected Parallel Computer," Communications of the ACM., Vol. 20, No. 4, April 1977, pp 263{271. 18