Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Similar documents
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

On Parallel Implementation of the One-sided Jacobi Algorithm for Singular Value Decompositions

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

A Parallel Ring Ordering Algorithm for Ecient One-sided Jacobi SVD Computations

Note Improved fault-tolerant sorting algorithm in hypercubes

FOR EFFICIENT IMAGE PROCESSING. Hong Tang, Bingbing Zhou, Iain Macleod, Richard Brent and Wei Sun

The LINPACK Benchmark on the Fujitsu AP 1000

Parallel Implementation of QRD Algorithms on the Fujitsu AP1000

Job Re-Packing for Enhancing the Performance of Gang Scheduling

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996

Parallel Computation of the Singular Value Decomposition on Tree Architectures

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

AN IMPLEMENTATION OF A GENERAL-PURPOSE PARALLEL SORTING ALGORITHM

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

As an additional safeguard on the total buer size required we might further

Dominating Sets in Planar Graphs 1

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Scheduling Unsplittable Flows Using Parallel Switches

II (Sorting and) Order Statistics

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Weak Dynamic Coloring of Planar Graphs

Mesh Models (Chapter 8)

Proceedings of MASPLAS'01 The Mid-Atlantic Student Workshop on Programming Languages and Systems IBM Watson Research Centre April 27, 2001

Do Hypercubes Sort Faster Than Tree Machines?

Optimal Partitioning of Sequences. Abstract. The problem of partitioning a sequence of n real numbers into p intervals

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Searching. Algorithms and Data Structures. Marcin Sydow. Divide and Conquer. Searching. Positional Statistics. Summary

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

for Hypercubic Networks Abstract This paper presents several deterministic algorithms for selecting the kth largest record

Statement-Level Communication-Free. Partitioning Techniques for. National Central University. Chung-Li 32054, Taiwan

Number Theory and Graph Theory

X(1) X. X(k) DFF PI1 FF PI2 PI3 PI1 FF PI2 PI3

Mark J. Clement and Michael J. Quinn. Oregon State University. January 17, a programmer to predict what eect modications to

Rajendra Boppana and C. S. Raghavendra. tation from the set of processors to itself. Important. problems like FFT, matrix transposition, polynomial

A Simplied NP-complete MAXSAT Problem. Abstract. It is shown that the MAX2SAT problem is NP-complete even if every variable

of m clauses, each containing the disjunction of boolean variables from a nite set V = fv 1 ; : : : ; vng of size n [8]. Each variable occurrence with

DSC: Scheduling Parallel Tasks on an Unbounded Number of. Processors 3. Tao Yang and Apostolos Gerasoulis. Department of Computer Science

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

We augment RBTs to support operations on dynamic sets of intervals A closed interval is an ordered pair of real

8ns. 8ns. 16ns. 10ns COUT S3 COUT S3 A3 B3 A2 B2 A1 B1 B0 2 B0 CIN CIN COUT S3 A3 B3 A2 B2 A1 B1 A0 B0 CIN S0 S1 S2 S3 COUT CIN 2 A0 B0 A2 _ A1 B1

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Computing farthest neighbors on a convex polytope

1 Introduction The concept of graph spanners has been studied in several recent papers in the context of communication networks, distributed computing

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

Application Programmer. Vienna Fortran Out-of-Core Program

Minimizing Total Communication Distance of a Time-Step Optimal Broadcast in Mesh Networks

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Object Placement in Shared Nothing Architecture Zhen He, Jeffrey Xu Yu and Stephen Blackburn Λ

(a) (b) (c) Phase1. Phase2. Assignm ent offfs to scan-paths. Phase3. Determination of. connection-order offfs. Phase4. Im provem entby exchanging FFs

Department of Computer Science. a vertex can communicate with a particular neighbor. succeeds if it shares no edge with other calls during

Autolink. A Tool for the Automatic and Semi-Automatic Test Generation

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

Efficient Bitonic Communication for the Parallel Data Exchange

residual residual program final result

A Performance Study of Parallel FFT in Clos and Mesh Networks

Oblivious Routing Design for Mesh Networks to Achieve a New Worst-Case Throughput Bound

Interleaving Schemes on Circulant Graphs with Two Offsets

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

Dieter Gollmann, Yongfei Han, and Chris J. Mitchell. August 25, Abstract

We assume uniform hashing (UH):

A General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic Errors 1

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000

Algorithms for an FPGA Switch Module Routing Problem with. Application to Global Routing. Abstract

A Nim game played on graphs II

The University of Newcastle. Australia. Abstract. This paper introduces a constrained recongurable mesh model which incorporates

Conflict-Free Sorting Algorithms under Single-Channel and Multi-Channel Broadcast Communication Models***

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp

15.4 Longest common subsequence

A Framework for Space and Time Efficient Scheduling of Parallelism

An Ecient Scheduling Algorithm for Multiprogramming on Parallel Computing Systems

An Optimal Parallel Algorithm for Merging using Multiselection

y(b)-- Y[a,b]y(a). EQUATIONS ON AN INTEL HYPERCUBE*

N. Hitschfeld. Blanco Encalada 2120, Santiago, CHILE.

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

Cluster Partitioning Approaches to Mapping Parallel Programs. Onto a Hypercube. P. Sadayappan, F. Ercal and J. Ramanujam

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

All-port Total Exchange in Cartesian Product Networks

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can

coordinates ([6]); it turns out that this language is also useful for ecient scanning of the relevant part of the backtracking tree. The idea of learn

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Transactions on Information and Communications Technologies vol 19, 1997 WIT Press, ISSN

1 Minimal Examples and Extremal Problems

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

Ray shooting from convex ranges

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P

Broadcasting on Meshes with Worm-Hole Routing. Supercomputer Systems Division. Scalable Concurrent Programming Laboratory

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia

N=9. s 2 (2) (3) (0)

In Proc. DARPA Sortware Technology Conference 1992, pp and Recongurable Meshes. Quentin F. Stout. University of Michigan

Theorem 2.9: nearest addition algorithm

The NP-Completeness of Some Edge-Partition Problems

Jop F. Sibeyn. called the k-k routing problem. Sorting is, next to routing, one of the most considered

Transcription:

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra, ACT 2, Australia Abstract The problem of merging two sequences of elements which are stored separately in two processing elements (PEs) occurs in the implementation of many existing sorting algorithms. We describe ecient algorithms for the merging problem on asynchronous distributed-memory machines. The algorithms reduce the cost of the merge operation and of communication, as well as partly solving the problem of load balancing. Experimental results on a Fujitsu AP1 are reported. 1 Introduction This paper considers an important aspect of parallel sorting algorithms on asynchronous distributedmemory machines. The algorithms of interest to us rst perform a local sort on each processing element (PE), then perform a sequence of merges to globally sort the data. Each merge involves pairs of PEs. The two PEs in a pair merge their two sorted sequences, and each PE eeps half of the merged sequence. It is nown that this \merge-split" scheme can be applied to achieve high eciency if a large sorting problem is to be solved on a machine with many relatively small processing elements [1, 3]. The most straightforward method for merging is rst to transfer a sequence of elements from one PE to the second PE, then to merge the two sequences in the second PE, and nally to transfer the upper half (or lower half) bac to the rst PE. This method is inecient, both because of its memory requirements, and because of the load imbalance { only one PE in each pair is active during the merging. The eciency Appeared in Proc. International Conference on Parallel and Distributed Systems (Hsinchu, Taiwan, Dec. 1994), IEEE Computer Society Press, California, 1994, 12-16. Copyright c 1994, IEEE. rpb142 typeset using LaT E X can be improved by taing advantage of the following observations. To merge two sequences of elements, it is only necessary for each PE to transfer a portion of its sequence to the associated PE. In many sorting methods, e.g., odd-even transposition sort [3], Batcher's merge-exchange sort [2], and parallel Shell sort [6], the entire set of elements becomes more nearly sorted after each iteration, so the portion of the sequence to be transferred from a PE tends to decrease. If we have an algorithm which can eciently nd the exact number of elements to be transferred between any pair of PEs, we may reduce the cost of the merge operation and communication, as well as partly solving the problem of load balancing. Our rst algorithm (Algorithm 1 below) nds the exact number of elements to be transferred in log 2 N communication steps, where N is the length of the sequence stored in each PE's local memory. At each communication step one element is sent and received by each PE. The algorithm wors well on distributed memory machines such as the Thining Machines CM5 and the Fujitsu AP1 [7], because these machines have a small \startup" time for communication and a small message latency, so the time for running the procedure is small in comparison to the total communication time. On a machine with a high message latency, the algorithm would be costly. Our second algorithm (Algorithm 2 below) requires only log N communication steps (for 2) to nd the exact number of elements to transfer, if we allow 1 elements to be transferred at each step. By properly choosing, the running time of the algorithm can be reduced. In Section 2 algorithms for nding the exact number of elements to be transferred are derived. Experimental results on the Fujitsu AP1 are given in Section 3. We use the odd-even transposition sorting method as an example to show the eciency gained by using the algorithm. The odd-even transposition

sort requires only nearest-neighbour communication in a one-dimensional array, so is applicable to most special-purpose machines with restricted communication topologies. Conclusions are drawn in Section 4. 2 Algorithms 1 1 1 1 To simplify our discussion, we assume in the following that the elements are distinct, are sorted in increasing order in each PE, and that each PE has the same number (N) of elements. A processing element is referred to as PE1 (or PE2) if it stores the rst half (or the second half) of the total elements after a merge. The elements in each PE are enumerated from to. The ( + 1) th element in PE1 (or PE2) is referred to as e (or e ), for < N. In the following Lemma we set e 1 and e 1 to 1 and e N and e N to +1 so that the inequalities still hold in two extreme cases when =, that is, no element in PE1 is smaller than any element in PE2; and when = N, that is, no element in PE1 is greater than any element in PE2. Lemma 1 To merge two sorted sequences of N elements each stored in one PE, the exact number of elements to be transferred between the two PEs is, for < N, if and only if the following inequalities are satised: e > e 1 ; e 1 < e : (1) Proof. Our aim is to merge two sorted sequences and store the rst half in PE1 and the second half in PE2. Suppose that is chosen so that the two inequalities in (1) are satised. Since the original sequences are sorted, we have e > e 1 and e > e 1. We transfer the last elements from PE1 to PE2 and the rst elements from PE2 to PE1. It is easy to see that, after the transfer, the largest element in PE1 is max(e 1 ; e 1 ), and the smallest element in PE2 is min(e ; e ). Thus, no element in PE1 is greater than any element in PE2. On the other hand, if is chosen so that either of the two inequalities in (1) is not satised, there must be at least one element in PE1 that is greater than the smallest element in PE2 after the transfer. Corollary 1 Given an arbitrary index, < N, we have e < e if < ; e > e (2) if < N: (a) < N (b) < Figure 1: A graphical expression of Corollary 1. 1 1 (a) > 1 1 1 1 (b) < 1 1 Figure 2: A graphical expression of Corollary 2. Proof. An illustration of the proof is given in Fig. 1. The two vertical lines represent the two sequences, while an arrow line pointing from the left (or right) sequence to the right (or left) sequence indicates e x > e N x 1 (or e x < e N x 1 ). Assume that there is an index for. We have e e and e 1 e, since the original sequences are sorted. However, it is nown from (1) that e > e 1. Thus the element e must be greater than e. The proof for the rst inequality is similar. Corollary 2 The index in (1) is unique. Proof. Suppose that there is another index satisfying the inequalities (1), as shown in Fig. 2. If >, we have 1, and thus the element e 1 is greater than e, by Corollary 1. Similarly, e is smaller than e 1 if <. In either case the result is a contradiction. Thus satises the inequalities (1) if and only if =.

N bottom 1 top mid 1 bottom N bottom 1 1 top N mid 1 1 mid N top 1 N mid 1 1 N top 1 1 1 N 1 " 1 N " 1 1 bottom (a) e mid < e N mid 1 (b) e mid > e N mid 1 (a) e < e (b) e > e Figure 3: Deciding the new search interval for the next step. Algorithm 1 Using the above lemmas, a simple \bisection" algorithm can be derived to nd the exact number of elements to be transferred between two PEs, or more specically, to nd the unique index which satises the inequalities (1). We outline the algorithm. At any stage the index is nown to lie in a certain range. The boundary indices, that is, the rst and last elements of the range, are called top and bottom respectively 1. Thus, is nown to be in (top; bottom]. PE1 sends the middle element of the interval, e mid, to PE2. This element is then compared with e N mid 1 in PE2, and the result is sent bac to PE1. If e N mid 1 is greater than e mid, the index must be in the halfopen interval (mid, bottom]; otherwise it must be in the half-open interval (top, mid], by Corollary 1. This is illustrated in Fig. 3. In either case the interval (top, bottom] can be updated. The procedure is applied until there is only one element in the interval. It is easy to see that dlog 2 Ne steps 2 are required to nd the index. This algorithm has been implemented on both the CM5 and the Fujitsu AP1 [7]. The results are good because both machines have a small message latency, so the time for nding is small in comparison with the total communication time. On a machine with a high message latency, the communication costs due to multiple small messages would be considerable. A modication of Algorithm 1 can reduce the search interval by more than a factor of two at each step, and thus reduce communication startup costs. The 1 Note that with our conventions top bottom. 2 A single step requires communication in both directions between PE1 and PE2. By alternating the roles of PE1 and PE2, s steps could be performed with s + 1 communications. Figure 4: A graphical expression of Lemma 2 modication follows from Lemma 2. In the lemma, if e > e N top 1, we dene = top, and if e < e N bottom 1, we dene = bottom. Lemma 2 Suppose that is within the interval (top; bottom] and that is an index in the interval. If e is greater than e and e N 1 is the rst element in PE2 which is greater than e for in the interval and <, then index must be in the halfopen interval ( ; ]. If e is smaller than e and e N 1 is the rst element in PE2 which is smaller than e for in the interval and <, then index must be in the half-open interval (; ]. Proof. We only prove the case e < e. The proof for e > e is similar. Since e < e, we have < by Corollary 1, and so e 1 e. We now from (1) that e > e 1, so e > e. Since e > e N 1, we also have e > e N 1. Thus e > e N 1 and by Corollary 1. This gives < (see Fig. 4). Using Lemma 2, a search procedure is required at each step in order to nd the index or. Since the search interval may be reduced by more than half, the number of steps may be less than the number required by Algorithm 1. Note that the number of steps may still be close to log 2 N in the worse case, which may occur when is close to or. In next paragraph we describe a new algorithm (Algorithm 2), in which 1 elements ( 2) are allowed to be transferred from PE1 to PE2 at each step, and the total number of steps is about log N. By properly choosing, the time to nd can be reduced signicantly.

Algorithm 2 Divide the search interval into smaller intervals with each of these intervals containing c i elements. We may obtain 1 elements in PE1. The original index of the rst of these elements is P top+c 1, and the original index for the l th l element is i= c i, where c = top. The 1 elements are sent from PE1 to PE2. Once the elements are received by PE2, a similar procedure to that for nding the exact number described previously is applied to nd, from the 1 elements, the index of the th element which satises the two inequalities el > e N L 1 e L c < e (3) N (L c ) 1 where e L is the th element and L is its original index. Liewise e L c is the ( 1) th element and L c is its original index. The only dierence between this procedure and Algorithm 1 is that all computations are performed locally and no extra communication is required in the procedure since the 1 elements have already been sent from PE1 to PE2. Lemma 3 If the above procedure is applied, the index must be in the half-open interval (L c ; L]. Proof. We prove that the two inequalities (3) cannot both be satised if index is not in the interval. If L < 1, we have e L < e N L 1 by Corollary 1. We also have e L c > e N (L c ) 1 if L c >. It is easy to see that in either case the inequalities in (3) cannot both be satised. Therefore, index must be in the interval L c < L. Since 1 elements are sent to PE2 at each step, a much smaller search interval can be decided for the next step, and the total number of steps required to nd the exact number is decreased. Supposing that all intervals have equal size at each step, the total number of steps is only log N. If is not very large, the \startup" time for communication will be dominant in the running time of Algorithm 2. Therefore, Algorithm 2 will be more ecient than Algorithm 1, by a factor of about log 2. Our experimental results on the Fujistu AP1 conrm this prediction. It is worth noting that the two algorithms are exactly the same if is set to one and the two intervals at each step are equally divided in Algorithm 2. Therefore, Algorithm 1 is just a special case of Algorithm 2. 3 Experimental Results We use the odd-even transposition sort as an example to show that the eciency can be gained by adopting the algorithms described in Section 2. Our experimental results were obtained on the Fujitsu AP1 located at the Australian National University. The Fujitsu AP1 is a distributed memory MIMD machine with up to 124 independent 25 MHz SPARC processors for processing elements. (Our machine has 128 PEs.) Each PE has 16 MByte of dynamic RAM and 128 Byte cache. The topology of the machine is a torus, with hardware support for wormhole routing. The communication networ (T-net) provides a theoretical bandwidth of 25 Mbyte/sec between PEs, and in practice about 6 MByte/sec is achievable. For details of the AP1 architecture and software environment see [4, 5]. The odd-even transposition sort is not optimal for the AP1 (better methods are given in [7]), but we use it because it only requires nearest-neighbour communication in a one-dimensional array. The methods of [7] tae advantage of the wormhole routing and use more general communication patterns (for example, communication along the edges of a hypercube). In our examples the 128 PEs are congured as a onedimensional array. The original odd-even transposition sorting algorithm described in [3], and the modied algorithms that utilise Algorithms 1 and 2, have been implemented to sort large sets of 32-bit integers on 128 PEs. The algorithms described in Section 2 wor almost equally well on the AP1 for moderate values of. This is because the AP1 has a small message latency (less than 1 sec). Thus, the cost of nding is only a small portion of the total communication cost, and the overall eciency gained by using > 2 is small. Some experimental results are given in Table 1. In this table \Program 1" is the program implemented for the original odd-even transposition sorting algorithm, and \Program 2" is a modied version which incorporates Algorithm 2 (as discussed in Section 2) with = 1. It is clear that the use of Algorithm 2 approximately doubles the eciency of the sort. 4 Conclusions We have described several algorithms for nding the exact number of elements to be transferred between two PEs when merging two sequences of elements stored separately in the PEs. Although the al-

Table 1: Experimental results. Problem size Program 1 Program2 (millions) time (sec.) time (sec.) 1.28 12.58 6.688 5.12 58.57 29.96 1.24 118.8 61.96 12.8 148.9 77.94 25.6 298. 156.1 38.4 447.4 235.2 51.2 496.7 313.5 64. 746. 394.8 [6] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J.. Salmon and D. W. Waler, Solving Problems on Concurrent Processors, Volume 1, Prentice-Hall, Englewood Clis, New Jersey, 1988. [7] A. Tridgell and R. P. Brent, An Implementation of a General-Purpose Parallel Sorting Algorithm, Report TR-CS-93-1, Computer Sciences Laboratory, Australian National University, February 1993. gorithms require a number of communication steps to send/receive small messages between the PEs, a large gain in merging eciency can be obtained. This is because the merge operations are performed by two PEs instead of just one PE, the computational load is better balanced, and the cost of transferring large messages between the PEs may be reduced. When the odd-even sorting method is implemented on the AP1, Algorithm 1 and Algorithm 2 (with > 2 but not too large) are almost equally eective in reducing the merge time. This is because the AP1 has low message latency. The dierence between the Algorithms 1 and 2 would be more signicant on a machine with a high message latency. References [1] S. G. Al, Parallel Sorting Algorithms, Academic Press, Orlando, Florida, 1985. [2]. E. Batcher, \Sorting networs and their applications", Proc. AFIPS 1968 Spring Joint Comput. Conf., 1968, 37{314. [3] G. Baudet and D. Stevenson, \Optimal sorting algorithms for parallel computers", IEEE Trans. on Computers, C{27, 1978, 84{87. [4] R. P. Brent (editor), Proceedings of the CAP Worshop '91, Australian National University, Canberra, Australia, November 1991. [5] R. P. Brent and M. Ishii (editors), Proceedings of the First CAP Worshop, Fujitsu Research Laboratories, awasai, Japan, November 199.