LEVELI ~~~~ COORDiNATEDSCIENCELABORA TORY NEW PARALLEL SORT NO_SCHEMES DOC APPUEDCOMPUTATION THEORY GROUP. Dialdbuffcn thilimit.

Similar documents
The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems.

IS BINARY ENCODING APPROPRIATE FOR THE PROBLEM-LANGUAGE RELATIONSHIP?

Algorithms - Ch2 Sorting

Assignment 1 (concept): Solutions

Practice Sheet 2 Solutions

Lecture 9: Sorting Algorithms

Yossi Shiioach. STAN-CS January COMPUTER SCIENCE DEPARTMENT School of Humanities and Sciences STANFORD UNIVERSITY

Searching Algorithms/Time Analysis

Principles of Algorithm Design

CSL 730: Parallel Programming

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

General context-free recognition in less than cubic time

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Preemptive Scheduling of Equal-Length Jobs in Polynomial Time

CL i-1 2rii ki. Encoding of Analog Signals for Binarv Symmetric Channels A. J. BERNSTEIN, MEMBER, IEEE, K. STEIGLITZ, MEMBER,

Analysis of Algorithms. Unit 4 - Analysis of well known Algorithms

Group Secret Key Generation Algorithms

Complete Variable-Length "Fix-Free" Codes

CSL 860: Modern Parallel

Plotting run-time graphically. Plotting run-time graphically. CS241 Algorithmics - week 1 review. Prefix Averages - Algorithm #1

Lecture 2: Getting Started

SUMS OF SETS OF CONTINUED FRACTIONS

Hypercubes. (Chapter Nine)

Theorem 2.9: nearest addition algorithm

1 (15 points) LexicoSort

EECS 2011M: Fundamentals of Data Structures

arxiv: v3 [cs.ds] 18 Apr 2011

Treewidth and graph minors

Assignment 1. Stefano Guerra. October 11, The following observation follows directly from the definition of in order and pre order traversal:

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

Analyze the obvious algorithm, 5 points Here is the most obvious algorithm for this problem: (LastLargerElement[A[1..n]:

Lecture 1. 1 Notation

5 Matchings in Bipartite Graphs and Their Applications

A Necessary Condition for Four-Coloring a Planar Graph

The divide and conquer strategy has three basic parts. For a given problem of size n,

Algorithms and Data Structures for Mathematicians

Algorithms and Applications

Efficient Parallel Binary Search on Sorted Arrays

Parallel algorithms for generating combinatorial objects on linear processor arrays with reconfigurable bus systems*

In this lecture, we ll look at applications of duality to three problems:

Algorithms Exam TIN093/DIT600

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Lecture 15 : Review DRAFT

MAC 1147 Exam #6 Review

Lecture 6 Binary Search

Solutions to the Second Midterm Exam

Introduction to Algorithms I

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

AD0A STANFORD UNIV CA DEPT OP COMPUTER SCIENCE F/6 12/1 VERIFICATION OF LINK-LEVEL PROTOCOLS,(U) JAN 81 D E KNUTH. N0R11476-C0330 UNCLASSIFIED

Write an algorithm to find the maximum value that can be obtained by an appropriate placement of parentheses in the expression

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

FINITELY GENERATED CLASSES OF SETS OF NATURAL NUMBERS

Smoothsort's Behavior on Presorted Sequences. Stefan Hertel. Fachbereich 10 Universitat des Saarlandes 6600 Saarbrlicken West Germany

Lecture 2: Algorithms, Complexity and Data Structures. (Reading: AM&O Chapter 3)

Optimal Variable Length Codes (Arbitrary Symbol Cost and Equal Code Word Probability)* BEN VARN

Solutions to Homework 10

Module 7. Independent sets, coverings. and matchings. Contents

FUTURE communication networks are expected to support

9 Bounds for the Knapsack Problem (March 6)

1 Computing alignments in only linear space

4. Simplicial Complexes and Simplicial Homology

H. W. Kuhn. Bryn Mawr College

ENGI 4892: Data Structures Assignment 2 SOLUTIONS

THE ISOMORPHISM PROBLEM FOR SOME CLASSES OF MULTIPLICATIVE SYSTEMS

Randomized Sorting on the POPS Network

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

Design and Analysis of Algorithms

CS 561, Lecture 1. Jared Saia University of New Mexico

CITS3001. Algorithms, Agents and Artificial Intelligence. Semester 2, 2016

The optimal routing of augmented cubes.

We will give examples for each of the following commonly used algorithm design techniques:

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

CSE 21 Spring 2016 Homework 4 Answer Key

REGULAR GRAPHS OF GIVEN GIRTH. Contents

Dynamic Programming Algorithms

perform. If more storage is required, more can be added without having to modify the processor (provided that the extra memory is still addressable).

Comparison Sorts. Chapter 9.4, 12.1, 12.2

Pierre A. Humblet* Abstract

Lecture Notes on Karger s Min-Cut Algorithm. Eric Vigoda Georgia Institute of Technology Last updated for Advanced Algorithms, Fall 2013.

The complexity of Sorting and sorting in linear-time. Median and Order Statistics. Chapter 8 and Chapter 9

Interleaving Schemes on Circulant Graphs with Two Offsets

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Final Exam. Problem Parts Points Grade Grader Problem Parts Points Grade Grader Total 180. Name: Athena username: Joe WF3a. Joe WF2.

COL351: Analysis and Design of Algorithms (CSE, IITD, Semester-I ) Name: Entry number:

Recursion and Structural Induction

Advanced Algorithms and Data Structures

Rigidity, connectivity and graph decompositions

PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

Symmetric Product Graphs

Improved upper and lower bounds on the feedback vertex numbers of grids and butterflies

II (Sorting and) Order Statistics

Decreasing the Diameter of Bounded Degree Graphs

CD _. _. 'p ~~M CD, CD~~~~V. C ~'* Co ~~~~~~~~~~~~- CD / X. pd.0 & CD. On 0 CDC _ C _- CD C P O ttic 2 _. OCt CD CD (IQ. q"3. 3 > n)1t.

T consists of finding an efficient implementation of access,

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

1 Undirected Vertex Geography UVG

Framework for Design of Dynamic Programming Algorithms

INFORMS is collaborating with JSTOR to digitize, preserve and extend access to Operations Research.

Transcription:

AD AOS6 886 LLNOS UNV AT URBANA CHAWAGN COORDNATED SCENCE LAB F/6 9/2 NEW PARALLEL SORTNG SCHEMES. (U) JUL 17 F P PREPARATA DAABO7 72 C 0259 i,wrl AcctrTrn P fl? NL N END 979

~~~~~~~~~~~~~~~ LEVEL i ~ ~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~, JULY 1977 COORDiNATEDSCENCELABORA TORY APPUEDCOMPUTATON THEORY GROUP ~~ NEW PARALLEL SORT NO_SCHEMES t _ JL. L STATEXENT A fat public i.saai Dialdbuffcn thilimit.d DOC ULU ENS 7?tEtS *

_. c ~~~..._. ~~~ ~~~~~~~~ ~~., ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~. ~... UNCLASSFED SECURTY CLASSFCATON OF THS PAGE (W ~ i.n Oat. En tsd) D ~~ DADT A mklta1 flu DA ~ ~~ C READ NSTRUCTONS i ~~~ i i ns. ~~~~~~~~~~~~~ ui l.pi ~ r ~~ J ~ BEFORE COMPLETN G FORM. REPORT NuMBER 2. GOVT ACCESSON NO. 3. RECPENT S CATALOG NUMBER TTLE (and Subtit?.) ( ~~ EW j i ~ ARALLEL.$ ~ RTNG SCHEMES R 782 ~ _ TYPE OF REPORT S PEROD COVERE0 ~~~~~~ hnica1 ~~~~~~. ~~ 1 1. CRFORM ~ NG ORG. REPORT NUMERJ ULU ENG 77 2229 1 THOR(s) S. CONTRACT OR GRANT NUMBER(b) F. P. Preparata DAAB 07 72 C 0259/ 9. PERFORMNG ORGANZATON NAM E AND ADDRESS to. PROGRAM ELEMENT, PROJECT. TA SK AREA WORK UNT NUMBERS Coordinated Science Laboratoryl University of llinois at Urbana Ch ampaign Urbana, llinois 61801 i t. CONTROLLNGOFFCENAME AND ADDRESS Joint Services Electronics Program ( A ~~~ ~~ _ffi1h 77 %.. il NUMW!WOF PAGES S. MONTORNG AGENCY NAME S AOORESS(1 d f.tont tem Cant rolling OHic.) S SECURTY CLASS. (of 11,1. r.port) 15 S. D ~~ CLAS5FCA1iON/DOWNGRADNG 15. DSTRBUTON STATEMENT (of this R.por() Approved for public release; distribution unlimited DSTRBUTON STATEMENT (of ffi. abstract.nt.,.d in BJoc* 2O i dff.rant from Røport) ~ @ BB 7ir Ci 1) ~~~ ~ ~~ S. SUPPLEMENTA S. KEY WORDS (Confinu. on v.v.ra..id. if n.c.saary and td.ntlfy by block numb.,) Parallel Computation Computational. Complexity Design of Algorithm ~~~~~~~~~~~ Sorting Enumeration Sorting 2O. ~~~ STRACT (Continua on r v ~ ra aid i n.c.a.ary and fd.nuify by block numb.r) ~ iy ~ :M C ~~ 4 ~ t1 ~ ~ t1 ~? j n this paper ~~~ describe ~,(a f amily of parallel sorting algorithms for a multi pro ~ easor system. These algorithms are enumeration sorts and comprise the following phases: (±1 ~~ ount acquisition: the keys are subdivided nto subsets and f or each key ve ~ 4.t ~ 4.m. ~ the number of smaller keys (count) in every subse ~ ; c4±t rank determination: the rank of a key is the sum of the previously obtained counts; (44i ~~ data rearrangement: each key is placed in the position specified by its rank. The basic novelty of the algorithms is the use of parallel merging to implement count acquisitio. using Valiant s merging DD ~~~~~ LOTO V. 7N0 ~~~~~fflete 1 5 ~ ~ ~ T. ~~~~~~~~~~~~~ 1E1 ~O~~ 1 1O ~ ~~~ L SEC u ~~~ TY CL A C A f l O N O ~ THS ~ AGE (Wt ~,n Oaf Ent.rØd)...~. ~~~

~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~. ~,. UNCLASSFED SCURstY CL.AUPCATON OP ThS PASS(u., Data LWsta ~~ 20. ABSTRACT (continued) scheme, we show tha t n keys can be sor ted in parallel with nlog 2 n procsssors in t ime Clogn n + o(log 2 n); in addition, if memory fe tch conflicts are not allowj, using a mod if ied vers ion of Ba tcher s merging algorithm to implement phase (i) we show that n keys can be o ~ ted with processors in time (C / ~ )log 2 n + o(log 2 n) thereby matching the performacne of Hirschberg s algorithm, which, however, is not free of fetch conflicts. p UNCLASSFED SECURTY C. ASSPCATON OF ths PAD W (W7. Oat. S fa d) ~~~. ~~~~~~~~~~~~~~,

om ~~ ~ v ~ ~~ ~ JMUHUtS..._..._... LEVEV A... ~~ S$T1SflhSS/aflUJ U1 ~ SSU ~~~~~~~~~~~~~~ ~~~if JLU EN C 77 2229 NEW A LLEL. SORTNG SCHEMES ~~ by F. ~ Preparata ti This work was supported in part by the National Science Foundation under Grant NSF MCS 76 1 7321 and in par t by the Joint Services Electronics Program (U.S. Army, U.S. Navy and U.S. Air Force) under Contract DAAB 07 72 C 0259. i i Reproduction in whole or in part is permitted for any purpose of the United States Government. Approved for public release. Distribution unlimited..1. ~~~~~~~~~.

. ~~~~~~~~~~~~~~~~~~~ ~ 1 NEW PARA LLE L SORT N G SCHEME S J F. P. Prepa rata*, Senior Member EEE University of llinois at UrbanaChampaign Abstract n this paper we describe a family of paralle l sorting algorithm s J for a mult iprocessor system. These algorithms are enumeration sortings and. ~ comprise the following phases : (i) count ac quisition : the keys are subdivided into subsets and for each key we determine the number of smaller keys (count) in every subset; (ii) rank determination : the rank of a key is the sum of the previously obtained counts; (iii) data rearrangement : each key is p laced in the position specified by its rank. The basic r novelty of the algorithms is the use of parallel merging to imp lement count a acquisition. By using Valiant s merging scheme, we show that n keys can a be sorted in parallel with n log 2 n processors in time C log 2 n + o(log 2 n); in addition, if memory fetch conflicts are not allowed, using a modified version of Batcher s merging al gorithm to imp lement ph ase (i), we show that n keys can be sorted with n H ~ processors in t ime (C / ~~ ) log n + o(log n), thereby matching the performance of Hirschberg s algorithm, which, however, is not free 2 2 9 of fetch conflicts. J * $ Coordinated Sc ience Laboratory, Department of Electrical Engineering, and Department of Computer Science, University of llinois, Urbana, L. 61801 This work was supported in part by the National Science Foundation under Grant MCS761732 1 and in part by the Joint Services Electronics Program under Contract DA.AB0772C0259. L

a NEW PARALLEL SORTNG SCHEMES F. P. Preparata 1. ntroduction The efficient implementation of comparison problems, such as merging, sorting, and selection, by means of multiprocessor computing systems has attracted considerable attention in recent years. One of the earliest fundamental results is due to K. E. Batcher [1], who proposed a sorting ne twork consisting of comparators and based on the principle of iterated merging ; as is wellknown, such scheme sorts n keys with 0(n(logn ) 2 ) comparators in time O((ogn) 2 ). Batcher s network is readily interpreted, in a more general framework, as a sys tem of n/2 processors with access to a common data memory of n cells: obviously, the network structure induces a nonadaptive schedule of memory accesses. After the appearance of Batcher s paper, substantial work was aimed at filling the gap between the upperbound O( (logn) 2 ) on the number of steps which is achievable by a ne twork of comparators and the lowerbound 0(logn) ; the lack of success, however, convinced several workers to look for more flexible forms of parallelism. The first scheme shown to sort n keys in time 0(logn ) is due to D. E. Muller and F. P. Preparata [2], but it requires a discouraging number of 0(n 2 ) processors. Subsequently, new results were obtained on parallel merging by F. Gavril [3]. L. G. Valiant [4] must be credited with addressing the fundamental question of the intrinsic parallelism of some p This work was supported in part by the National Science Foundation under Jrant HCS7617321 and in part by the Joint Services Electronics Program ~. under Contract DAAB0772C0259..

r ~~~~ ~~~~~~~~ 1 L processors, each capable of randomaccessing a common memory with no alignment penalty. Store, fetch, and arithmetic ope rations have unit costs, and fetch conflicts are disallowed when appropriate. All of the algorithms described in this pape r as well as Hirschber ~~ s [5] are instances of enumeration sorting, in Knuth s termi. no logy ([6], p. 73). n these methods each key is compared wi th all t he E o thers and the number of smaller keys determines the given key s fina l position. Specifically, [ th ree distinct tasks are clearly identifiable in enumeration sorting al gorithms: [ (L) coun t acquisition. The set of keys is partitioned into subsets E and for each key we determine the number of smaller keys in each subset (this informal description momentarily assumes that all keys are distinct) ; [ (ii) rank computation. For each key the sum of the counts obtained [ in (i) gives the final position (rank) of that key in the [ sor ted sequence; (iii) data rearrangement. according to its rank. Each key is pl aced in its final position Less informally, an enumeration sorting scheme has the following forma t, [ where we assume for simplicity that, for some given integer r, n = kr. Data structures to be used are arrays of keys. By A{i:j] we denote a. sequence A[i]A[i+l]...A[J]. nput: A[0:nl], the array of the keys to be sorted, integer r Output: A[0:nl], the array of the sorted keys.

.. 1 1. begin Define A ~ [O:rl] A[ir:(i+l)rl], for i=o,..., k1. [A ~ Oi) A J EhJ A ~ [L]J for j < 2 (ij)_. C L (A ~ (h) A ~ [h] <A i [L]) for j > i c ~~~ ~ [A ~ [h] la ~ [hj A ~ [L],h < L) U [A ~ [h] k ~ [h] < A 4 [L],h > z) 3. r.ink(a.[2] ) 4 E j=o 4. A[rank(A ~ [L]) ] ~ A ~ [L] Note that count acquisition, rank computation, and data rearrangement are per formed, respectively, in steps 2, 3, and 4. Also, the algorithm must insure that al l ranks be distinct, which is a crucial condition for the data rearrangement task (otherwise memory store conflicts would occur). This clearly poses no problem when the keys are all distinct. n the opposite case, some convention mus t be adopted for the ordering of sets of identical keys. One such convention is that sorting be s table (see [6], p. 4), that is, the initial order of identical keys is preserved in the sorted array. Thus, all of our sorting schemes will be stable. This is reflected in the rules for the computation of the parameters ~~~~~ in Step 2 of the above al gorithm. The simple al gorithm proposed by Muller and Preparata in [2] is a crude example of enumeration sorting, in which the sets A are chosen i to be singletons. Wi th this choice, each key is compared with every other L key, thereby using 0(n 2 ) processors; similarly, rank computation uses 0(n 2 ) processors, since 0(n) processors are assigned to each key. The time bound P EL 0(logn) is due to Step 3 (counting in parallel the number of l s in a set of n binary digits), whereas Steps 2 and 4 run in constant time in our present model.

n the more complex procedures to be later described, the ope rations of rank computation and data rearrangement are essentially carried out as in the basic scheme described above. The main difference occurs with regard to count acquisition. n the MullerPreparata method the counts are ac quired by comparing each key with every other. The comparison of two keys A [i] and A [j] could be viewed as merging A [i] and AU ]. f rather than dealing with single keys we now deal with sorted sequences of keys A ~ [0:rl1 and A ~ [O:rl] ~ where r > 1 and, say, j < i, then the number of keys in A ~ [O:rl] which are no greater than A ~ Et] (L O,...,rl) as well as the number of keys ir9 A.[0: r1l which are less than A ~ [h] (h=o,...,rl), can be ob tained by merging the two sequences A ~ [O :r1l and A ~ [O:rl]. n fact, let B [O:2rl] be the array obtained by merging the two sorted arrays A [O:rl] and A 3 ~ [O:rlJ with the ordering convention & [s] < ~ A [s+1] (k=i k,j) and B[s] S B [s+l]. Suppose also that the merging be stable, that is, the order of identical keys in the concatenated array A ~ [0:r1]A ~ [0:rl] is preserved in B[O:2rl]. f B[q] A [L], then 1 there are (ql) entries of A ~ [0:rl] in B[0:ql] which are no greater than A ~ [L) ; similarly if B [q] = A ~ [h] ~ then there are (qh) entries of A [O:r_l] i in B [0:ql] which are strictly less than A ~ [h]. This is the central idea of the algorithms to be described. 2. A fast parallel sorting algori thm n this section we assume that in our computationa l mode l memory fetch conflicts are permitted. To provide the feature required by Valiant s merging algorithm, that a key be simultaneously compared with severa l other keys, we may assume that the processors have broadcast capabilities. The only overhead we shall neglec t is the reassignment of processors to the a operation of merging pairs of subsequences, as occurs in Valiant s method [4). L ~~~~~~~~~ ~~~~~~~~~~~ ~~. _

if : ~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~. 1 6 Notice that this model of parallel computation coincides with that required by Valiant s merging algorithm. j We assume inductively that the following algorithm, SORT1, for p < n uses at most [plogpj processors to sort p keys. Since SORT is recursive, the following presentation constitutes a cons t ructive extension [ L of the inductive step to the intege r n. The induction can be started with Algorithm SORT1 begin 1. k riogni, r. ~ / riogn1j L 2. Define arrays S{0:k;0:k;O:2rl] and R [0:k;O:k;0:rl] (threedimensiona l arrays) and A 1 ~ [0:r ~ lj A [ir:(i+l)r1] (i=0,...,k l\, A. [ ] K A[kr:nl](for n > kr). Comment: When nrkr, array A k is obvious ly vacuous. Array S is defined for simplicity as having (k+l) 2 2r cells, although the algorithm will only make use of the cells s [i;j:qj for which i j. 3. A ~ [0:rl] SORT l (A i [0:r ~ l]) (i 0,...,kl) A k [0: n ~ kr ~ l] ~ SORTl (A,K [O:n ~ kr ~ l]). Comment: This step is a parallel recursive call of SORT 1 and it involves sorting in parallel k sets of r keys each and, possibly. one set of (nkr) keys. By the ind uctive hypothesis, it uses at most k lrlogrj + L (nkr)log(nkr)j N processors. Since p. riogni.[ ~~,r1ogn1j.log( ~, r1ogn1jj + Lrlognl og Flo gnlj nkr < rlognl, the number of processors used is less than ~ nlog(n/ lognl)+ rlognl log rlognl nlogn_log rlognl (nlognl ) ~ nlogn i ~ L ~ lognj, for n 3. For the sake of uni formity, array A. K is now extended to size r, where each cell of A. K [n ~ kr:r ~ l] is filled [ with a duusny sentinel larger than any key. ~~~~~~~~~~~~~~~~~~~

.. a ~~ J. ~ <.. {WW_ ~~~~~~~~~~~~~~~~~~~~~~~~..,. 7 4. S[i;j;0:rl] A ~ [O:r ~ 1](i=O,...,k ~ l; jai+l,...,k) SLL;j;r:2ri] A ~ [O:r1](i=0 ~... ~ J 1; j=l,...,k) Comment: This is a copying operation whose objective is to obtain S [i;j;o:2rl] = A 1 [0:rl]A ~ [O:r 1] for all pairs (i,j) with i < 3. n our model, this operation could be done with maxima l parallelism. However, using only ( ~~~ )r processors, the (k+l )2r elementary copying operations are completed in two time units. For later convenience we assume tha t the relord associated with key A ~ [L] contains a LABEL consisting of the pair of integers (i,l). 5. s[i;j;0:2r l] MERGE (S [i;j;0:r 1], s [i;j;r:2r 1]) (i=o,...,k1; j=i+l,...,k) Comment: This step uses Valiant s merging al gorithm and runs in time C loglogr + 0(1), 1 for some constant c 1, using (k+l)r processors. The origina l version of Valiant s merging algorithm can be readily modified, so that, whenever two keys are identical the indices of their respective subarrays are compared. f x=i then R[i;j;f] ql else R[j;i;L] q (i=0,..., k l; 3 if l,...,k; q=o,..., 2r l) 6. Let (x,l) LABEL s[i;j;q] ~ 7. R[i;i;d (i O,..., k;.t=o,...,r 1) Comment: Steps 6 and 7 complete the count acquisition task. n ; fact after Step 7 the content of R[i;j;L] is ~~~~~ in the 9 terminology of Section 1. Step 6 can be executed in two time units using ( ~~~ )r processors, whereas Step 7 uses (k+l)r processors and runs in one time unit. 4

~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ k 8. rank(a i E]) ~ ~ R [i;j;2] (iao,...,k; L=0,...,r1) 3=0 Comment: This step implements the rank computation. For any 8 pair (i,l) the sum can be computed with L(k+l)/2J processors in time log(k+l)1 loglogn. The total number of processors used is therefore n L(k+l)/2J. 9. A [rank (A ~ LL])J A ~ LL] (i=o,...,k; L=O,...,rl) To complete the analysis of the algorithm, we observe that none of Steps 47 uses more than ( ~ 2 )r processor. But 9 _ r k(k+l) = ~~ / 1ogn1Jfiogn1 ~ ) n lognl+l 2 where the last inequality is due to the remova l of the floor sign. Also, Step 8 uses n L (k+l)/2j < n (lognl+l)/2. Since, for all r ~ 4,n ( 1 lognl+l)/2 < LplognJ, the inductive hypothesis on the number of processors is extended. Finally, let T(n) denote the running time of the algorithm for n keys. Since r ~~ n/logn we obtain T( n) = ~A T(_. ~ c2 loglogn + 1 gn c 3 for some constants C 2 and C3. t is easily verified that a function of the f orm C 2 (logn) + o(logn) is a solution of the above recurrence. t is worth noting that for the same number of process ~ rs, Valiant proposes a sorting scheme of the merge sort type ([4], Corollary 8) which runs in time 2logn loglogn o(logn loglogn). 3. Paralle l sorting algorithms with no memory fetch conflicts We shall now consider a family of algorithms for sorting n numbers l+ ~ in parallel wi th n processors (0 < o~~~ 1) in time (C /or)logn + o(logn),

r ~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~.~~ ~~~~~ ~~~~~~~~~ 9 for some cons tant C. Each of these algorithms has the same performance as the corresponding algorithm by Hirschberg [5], although no memory fetch conflict occurs in this case. Again, we make the inductive hypothesis that for p < n, Algorithm SORT2 uses processors to sort p keys. The ~~~~ format of SORT2 closely parallels that of SORT1, with a few crucial differences to be noted. Algorithm SORT2 begin 1. k n, r t.n/ n 1.j 2. Define arrays s[0:k;o :k;o:2rl], R[0:k;O:k;0:r1] and A ~ [0:r1] A [ir:(i+l)r1] (i=0,...,k_l),a.[o:n_kr li k. A [kr:nl] for n>kr. ~ 3. A ~ [O :r_ 1J SORT2(A ~ [O:rl]) (i+o,...,kl),ak [O :fl_kr_l] SORT2 Comment: This parallel recursive call of SORT2 sorts k sets of r keys each and, possibly, one set of nkr < k keys. By the inductive hypothesis, at most kr ~~~ + (nkr) ~~~ N processors are ~ 9 ( = used. Since nkr <k, then N < kr ~~~ + (nkr) ~~ = kr(r ~.k ~ )+ ~~~~ Also kr =. ~ p/ r ~ i j ~ n, whence N < n (r ~~k~+k a ) ~~~~~~~~ fl l+ ~~ O ~ < ~~~~~ where we have used the approximation r ~ nba. Steps 13 are analogous to the corresponding ones in SORT ; however, the copying operation implemented by Step 4 of SORT1 must be considerab ly modified, as shown by the following Steps 46, to avoid fetch conflicts. Here again, A ~ K as in SORT1. s [o;j;r:2r1] A [0:r l] (jl,...,k) 3 4. S [i;k;o:rli A ~ [O:rl] (i O,...,k.l) is extended to size r..... ~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ t~~~~~~ ~~

S. for m 0 until log(k+l)1 2 ~~~~~ 10 sli;j~ 2 m ;0:r~ 1j S[i;j;O :r1j s[i+2 m ;j;r:2r_lj (j=k_2 m +l,...,k; ~_Q, ~~,J_ 2 m..1) S [i;j;r ~ 2rij P (i=0,.. ~, 2 m ~~ 1; 3 ~~~ 2 m ~ 1..,k) b. Let log(k+i)l 1 = v. SLi;j2 ;O :rl] S [i;j;o:rlj (32 +l,...,k; jø ~~~~~~~ j_2 V _ 1) s[i+2 ;j; r:2rl] S[i;j ;r:2rl] (i~.o,...,k_2 V ~ l ; J.~~ 2 v l k) Comment: Steps 46 jointly replicate each A ~ [O:r1] the required number k of times. Step 4 is an initial copy ; Step S consists of (log lk+fll) stages, each of which doubles the ranges of the indices ; Step t ~ ac counts tot the fac t that k may not be a power of 2 and com p letes filling the array S. Clearly this copying opera t ion is implemented in log rk+l1 + 1 ~ logn + 1 time units. A straightforward analysis shows that the largest number of processors used in any of these stages is at most 5/16 of the total number (~~ 1)2r of cells of S to be filled. t is also easily shown that (S/l6)( k+l \ 2r (5/l6)(n ~ +l)n ~.n 1 ~~ < n ~~ for any n land ~~~ >O. 7. S[i;j;O:2r 1] ME RG E (SLi;J;0:rl], S.i;j;r :2rlJ) (i= 0,...,k 1; j=i+1,...,k). Comment: This step uses a stable version of Batcher s merging algorithm [i), which is easily obtained by requiring that whenever two identical keys are encountered their array indices be compared (see Appendix). The following facts about Batcher s merging algorithm are wellknown :

9_ ~~ ~~~~~~ ~ ~~~~. r ~~~. ~~~~~~~~~~~~ ~~ HE (i) no fetch conflict occurs because at any stage (or, time unit) ii each key is compared with exactly one key ; (ii) (~~ 1 r L(n ~ +1)n ~ / 2]. ~ ~ l+o processors are used ; (iii) merging is completed in logr (lo )logn time units. 8. Steps 8, 9, 10, and 11 of this al gorithm are respectively identical to Steps 6, 7, 8, and 9 of SORT and are therefore omitted. The latter are clearly free of memory fetch conflicts. The analysis of SORT1 showed that at most max (( ~~ )r ~ nl (k+l)/2j ) processors were used in any of those steps. n the present case, (k+1 \ we have already shown that )t ~ 2 < i nl(k+l)/2j ~ n(n ~ +l)/2 < n ~~~. ; similarly we conc lude From the performance viewpoint, all steps of the algorithm require at most ~~~~ processors, as postulated. This extend s the inductive hypothesis or. the nusber of processors used by the algorithm. As to the running time T(n), we note the following: Steps 46 jointly require ~ 1ogn + 1 time units ; Step 7 requires (l ~ )logn time units ; Step 10 requires ~ Logn t ime units ; Steps 8, 9, and 11 run in cons tant time. Since Step 3 is a recursive call of SORT2 on sets of r equation ~~~~ elements, we obtain for T(n) the recurrence T(n) = T(n ~~~ ) + (Cj ~ fcp 1ogn + for some constants C ~, C ~, and C ~. t is easily verified that a tunction of p the form LC ~~~ C ~ )/n.hogn + o(logn) is a solution of this e quation, whence T(n) ~ (C /c ~ )logn + o(logn).

~~~~~~~~~ _ References 1. K. E. Batcher, Sorting networks and their app lications, Proc. AFPS Spring Joint Computer Conference, Vol. 32, pp. 307314, April 1968. 2. D. E. Muller and F. P. Preparata, Bounds to Complexities ot Networks tor Sorting and for Switching, Journal of the ACM, Vol. 22, No. 2, pp. 195201, Apr il 1975.. 3 F. Ga,ril, Merging with parallel processors, Comm. ACN, Vol. 18, 10, pp. 58859 1, October 1975. ~ j 4. L. C. Valiant, Parallelism in Comparison Problems, SAM Journal of Computing, Vol. 4, 3, pp. 34 ~ 33S, September 1975. r ~ 5. D. S. Hirschberg, Fast Parallel Sorcing Algorithms, Tech. Rep., Department of Electr. Eng., Rice University, Houston, Texas, January 1977., ~ 6. D. ~~. Knuth, The Art of Computer Programming. Vol. : Sorting and Searching, AddisonWesley, Reac ~.ing, Mass., 1972.

~ 13 Appendix A stable version of Batcher s merging algorithm. The original version of Batcher s oddeven merging algorithm runs as follows (here, for simplicity, we assume that the common length of the sequences to be merged is a power of 2 ): MERGE (A[O: 2 kl _ k_l k_l 13, A[2 : 2 ]) 1. A [j] A[2j], A,[2 k4 + j] A[2j+ 1] (j=o, 1,. ~~~~~~~~~ 2. B[0: 2 k ~ l 1] MERGE (A [O: 2 k_2 1], A[2 k2 : 2 k_ 1 1]) BL 2 k1 : 2 k 1] MERGE (A [2 ~~ 1 : 3.2 ~~ 2 ~ 1], A [3.2 ~~ 2 : 2 k ~ 1]) 3. A[2jl] inin(b[j], B [2 k1 + j 1]), A[2j ] max(b[j], 8[2 k ~~ + i i] (J= 1, ~~,2 k ~ l _1) This algorithm is not stable (see [6], p. 135, exercise 13), because in compliance with the rules of the comparator module, whenever B[j] = + jij in Step 3, the algorithm assigns A[2jl] 1 A [2j] ~ B[2~~ + jlj. Fortunately, however, with a simple modification, s tabi lity can be attained. Specifically, we associate with each key o the initial array A[0: 2 k _1~ a label, and set LABEL(A[j]) j (notice that all labels are distinct). We then replace Step 3 above with the following step: 3. U B [J] = B [2 k1 + jl] then A[2jl] key with smaller label A [2j] key wi th larger label U else A [2jl] min (B[j], B [2 kl + jl]), A [2j] max(b [j], B [2 ~~~ + jlj ~ (J...11~~ ~~~2 k ~ l ~~1) 1 ~ L We now prove that the new version of the algorithm is stable. Assume that in the original array A[0: 2 1c_ 1 _ 1] the subarrays AFO: t l], 1 A[t : s _ lj, and A[s : 2 k_l _ 1) contain keys which are respectively less, 1 ~~ 1 equa l, and larger than some fixed value a; similarly for A[2 ~~~ : 2 k.1 +t 1] ~~~~~~~~~~~~ ~~~ S ~~~~~

~ _ 4 ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ t. 14 A[2 k ~~+t 2 : 2~~ l + s 2 ~ 1], and A[2 ~~ + 8 2 : 2 k 1~~ Assume inductively that the merged sequences obtained in Step 2 are stably sorted and consider a key ALp] for pe[t 1, s ]. Assume at first p = 2j; then, by Step 1, 1 2 t 1 A [j] A [2j]. Moreover, there are 1t /2l keys in A [2 2 ~~ : 2 4] strictly smaller than A [2j]; whence AL2J] = B[j + 1t /2 1]. 2 According to Step 3 B[j+ 1t /21] is compared wi th B[2 k1 2 + j_i + Ft 2 /21]. Suppose t is even: 2 then if (2jl) E [t 1, s l], B [2 k1 1 + j ~ i + 1t /21] = A[2j 1] and we compare 2 (the labels of) A [2j] and A [2jl]; otherwise, i.e., when 2j1 = t l or 1, equivalently, A [2j] = ALt 1 ], the latter is compared with a key less than a. Suppose now that t is odd: then if 2 (2j+1) E [t 1, s l], we have 1 8[2 k1 + J~ 1 + 1t 2 /21 ] = A [2j l] and we compa re ( the labels of) A[2j] and A[2j+l]; otherwise A [2j] = A[s l] and A[s 1] is compared with a key which 1 1 is either larger or has a larger label. Clear ly, in the given case, stability is ensured, and by analogous arguments we can treat all other cases. L p ~~