10. Parallel Methods for Data Sorting

Size: px

Start display at page:

Download "10. Parallel Methods for Data Sorting"

Lorena Stanley
6 years ago
Views:

1 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting Parallelizing Princiles Scaling Parallel Comutations Bubble Sort Sequential Algorithm Odd-Even Transosition Algorithm Comutation Decomosition and Analysis of Information Deendencies Scaling and Distributing Subtasks among Processors Efficiency Analysis Comutational Exeriment Results Shell Sort Sequential Algorithm Parallel Algorithm Efficiency Analysis Comutational Exeriment Results Quick Sort Sequential Algorithm The Parallel Quick Sort Algorithm Parallel Comutational Scheme Efficiency Analysis Comutational Exeriments Results The Parallel HyerQuickSort Algorithm Software Imlementation Comutational Exeriments Results The Parallel Sorting by Regular Samling Parallel Comutational Scheme Efficiency Analysis Comutational Exeriment Results Summary References Discussions Exercises...3 Data sorting is one of tyical roblems of data rocessing and is usually considered to be a roblem of redistributing the elements of a given sequence of values S = { a1, a,..., an in order of the monotonic increase or decrease S ~ S' = {( a, a ' ' ' ' ' ' 1,..., an) : a1 a n... a (hereinafter we will be discussing only the examle of data sorting in the increasing order). The ossible methods of solving this roblem are broadly discussed. The work by Knuth (1997) gives a comlete survey of the data sorting algorithms. Among the latest editions we may recommend the work by Cormen et al. (001). The comutational comlexity of the sorting methods is considerably high. Thus, for a number of well known methods (bubble sort, insertion sorting etc.) the number of the necessary oerations is determined by the square deendence with resect to the number of the data being sorted T ~ n.

2 For more efficient algorithms (merge sorting, Shell sorting, quick sorting) the comlexity is determined by the following value: T ~ nlog n. This relation gives also the lower estimation of the necessary number of oerations for sorting the set of n values. The algorithms of smaller comlexity may be obtained only for articular variants of the roblem. Data sorting seedu may be rovided by means of using several (>1) rocessors. In this case the data is distributed among the rocessors. In the course of calculations the data are transmitted among the rocessors and one art of data is comared to another one. As a rule, the resulting (sorted) data are distributed among the rocessors. To regulate such distribution a scheme of consecutive numeration is introduced for the rocessors. It is usually required that after sorting termination the data located on the rocessors with smaller numbers should not exceed the values on the rocessors with greater numbers. The extensive analysis of the data sorting roblem should be a subject of further consideration. In this Section the main attention is devoted to the study of arallel methods of execution for a number of well known methods of internal sorting, when all the ordered data on each rocessor may be fully located in main memory. This Section has been written based essentially on the teaching materials given in Kumar, at al. (199) and Quinn (00) Parallelizing Princiles Under closer consideration of data sorting oerations alied in sorting algorithms, it becomes evident that many methods are based on the same basic comare-exchange oeration. This oeration consists in comaring a air of values of the data set being sorted and exchanging the values, if their values do not corresond to the sorting conditions. // Basic comare-exchange oeration if ( A[i] > A[j] ) { tem = A[i]; A[i] = A[j]; A[j] = tem; Examle Basic comare-exchange oeration of many sorting rocedures The successive alication of this oeration makes ossible to sort the data. In many cases just aroaches for choosing the airs of this oeration determine the main difference between the sorting algorithms. Let us consider the situation when the number of rocessors coincides with the number of values being sorted (i.e. =n) and, as a result, there is only one value of the initial data on each rocessor. This consideration will be done for arallel generalization of the selected basic oeration. Then the comarison of the values a i and a j, located corresondingly on rocessors P i and P i, may be organized in the following way (a arallel generalization of the basic sorting oeration): Exchange the values available on rocessors P i and P i (the initial elements must be ket on the rocessors), Comare on each rocessor P i and P i the obtained identical airs of values (a, a i j); the results of the comarison are used for data distribution among the rocessors: the smaller element remains on a rocessor (for instance, P i ), the other rocessor (i.e. P j ) stores the greater value of the air for further rocessing ' i a = min( a, a ), a = max( a, a ). i j ' j 10.. Scaling Parallel Comutations i j Such arallel generalization of the basic sorting oeration may be adequately adoted for the case when <n, i.e. the number of rocessors is smaller than the number of the values being sorted. Each rocessor in this situation will already hold a art (a block of size n/) of data being sorted. Let us define the result of the arallel sorting algorithm execution as such the situation, when the data on the rocessors are sorted and the order of block distribution among the rocessors corresonds to linear numeration order (i.e. the value of the last element on the rocessor P i is less or equal to the value of the first element on the rocessor P i+1, where 0 i <-1). Blocks are usually sorted at the very beginning of sorting on each rocessor searately by means of some fast algorithm (the initial stage of arallel sorting). Then in accordance with the described above scheme of a single value comarison, the interaction of the rocessors P i and P i+1 for sorting the air of blocks A i and A i+1 can be imlemented as follows:

3 Execute the exchange of blocks between the rocessors P i and P i+1, Unite the blocks A i and A i+1 on each rocessor into a sorted block of double size (if blocks A i and A i+1 have been initially sorted, the rocedure of uniting is reduced to fast merging the sorted data), Subdivide the obtained double block into two equal arts and leave one of the arts (for instance, with smaller data values) on the rocessor P i ; then the other art (with the greater values corresondingly) must be located on the rocessor P i+1 ' ' ' ' ' ' ' ' [ Ai Ai + 1 ] сорт = Ai Ai + 1 : ai Ai, a j Ai + 1 ai a j. This rocedure is usually called the comare-slit oeration. It should be noted that the blocks formed as a result of the rocedure on the rocessors P i and P i+1 are of the same size as the initial blocks A i and A i+1 and all the values located on the rocessor P i, do not exceed the values on the rocessor P i+1. The above mentioned comare-slit oeration may be defined as the basic comutational subtask for organizing arallel comutations. As it follows from its construction, the number of such subtasks arametrically deends on the number of the available rocessors. As a result, the roblem of scaling the comutations for arallel algorithms of data sorting became ractically unnecessary. Alongside with this it should be noted that the data blocks of the subtasks change in the course of sorting. In simle cases the size of data blocks in the subtasks remains the same. In more comlicated situations (as, for instance, in the quick sorting algorithms, - see Subsection 10.5) the amounts of data located on the rocessors may be different, which may lead to the violation of equal comutational rocessor loading Bubble Sort Sequential Algorithm The sequential bubble sort algorithm (see, for instance, Knuth (1997), Cormen et al. (001)) comares and exchanges the neighboring elements in a sequence to be sorted. For the sequence (a, a,, a ) 1 n the algorithm first executed n-1 basic comare-exchange oerations for sequential airs of elements (a 1, a ), (a, a 3 ),..., (a n-1,a n ). As a result, the biggest element is moved to the end of the sequence after the first algorithm iteration. Then the last element in the transformed sequence may be omitted, and the above described rocedure is alied to the remaining art of the sequence (a' 1, a',..., a' n-1 ). As it can be seen, the sequence may be sorted out after n-1 iterations. The bubble sorting efficiency may be imroved, if the algorithm is terminated when there no changes of the data sequence being sorted in the course of some successive sorting iteration. // Algorithm // Sequential bubble sorting algorithm BubbleSort(double A[], int n) { for (i=0; i<n-1; i++) for (j=0; j<n-i; j++) comare_exchange(a[j], A[j+1]); Algorithm Odd-Even Transosition Algorithm The sequential bubble sort algorithm The bubble sort algorithm is rather comlicated for arallelizing. The comarison of the value airs of the sorted data is strictly sequential. In this connection the modification of the algorithm, which is known as the oddeven transosition, is used in arallel alication, - see, for instance, Kumar et al. (003). The essence of modification may be described as follows: two different rules of executing the method iterations are introduced into the sort algorithm. The elements with odd or even indices corresondingly are chosen for rocessing deending on the even or odd number of the sorting iteration. The selected values are comared to their right neighboring elements. Thus, at all odd iterations the following airs are comared: (a 1, a ), (a 3, a ),..., (a n-1,a n ) (if n is even), at even iterations the following elements are rocessed (a, a 3 ), (a, a 5 ),..., (a n-,a n-1 ). After n sorting iterations the initial data aears to be ordered. 3

4 //Algorithm 10. // Sequential odd-even transosition algorithm OddEvenSort ( double A[], int n ) { for ( i=1; i<n; i++ ) { if ( i%==1 ) { // odd iteration for ( j=0; j<n/-; j++ ) comare_exchange(a[j+1],a[j+]); if ( n%==1 ) // the comarison of the last air, if n is odd comare_exchange(a[n-],a[n-1]); if ( i%==0 ) // even iteration for ( j=1; j<n/-1; j++ ) comare_exchange(a[j],a[j+1]); Algorithm 10.. The sequential odd-even transosition algorithm Comutation Decomosition and Analysis of Information Deendencies Obtaining a arallel variant for the odd-even transosition method does not cause any roblems. The airs of values at sorting iterations may be comared indeendently and in arallel. In case when <n, i.e. the number of rocessor is less than the number of the values being sorted, the rocessors contain the data blocks of n/size. The comare-slit oeration may be used as the basic comutational subtask (see Subsection 10.). //Algorithm 10.3 // Parallel algorithm of odd-even transosition ParallelOddEvenSort(double A[], int n) { int id = GetProcId(); // Process number int n = GetProcNum(); // Number of rocessors for ( int i=0; i<n; i++ ) { if ( i% == 1 ) { // Odd iteration if ( id% == 1 ) { // Odd rocess number // Comare-exchange with the right neighbor rocess if ( id < n -1 ) comare_slit_min(id+1); else // Comare-exchange with the left neighbor rocess if ( id > 0 ) comare_ slit_max(id-1); if ( i% == 0 ) { // Even iteration if( id% == 0 ) { // Even rocess number // Comare-exchange with the right neighbor rocess if ( id < n -1 ) comare_ slit_min(id+1); else // Comare-exchange with the left neighbor rocess comare_ slit_max(id-1); Algorithm The arallel odd-even transosition algorithm To exlain this arallel method of data sorting Figure 10.1 shows the examle of data sorting when n=16, = (i.e. the block of values on each rocessor holds n/= elements). The number and tye of the method iteration are given in the first column of the table. The same column shows the airs of the rocessors, for which the comare-slit oeration are executed in arallel. The interacting airs of rocessors are shown in the Table in double-lined frames. The Table shows the state of data being sorted for each sorting ste before and after iteration execution. Table The examle of data sorting by means of the arallel odd-even transosition method and tye Processors of iteration 1 3

5 Initial Data odd (1,),(3.) even (,3) 3 odd (1,),(3.) even (,3) In the general case the execution of the arallel method may be terminated, if there are no changes in the state of the data being sorted during two sequential iterations of sorting. As a result, the total number of iterations may be reduced. To imlement such modification a control rocessor should be introduced for fixing such situations. This rocessor should determine the state of the data after the execution of each sorting iteration. However, the comlexity of this communication oeration (gathering the messages from all the rocessors) may be so significant that the overhead of data communications will exceed the effect of the ossible reduction of method iterations Scaling and Distributing Subtasks among Processors As it has been reviously mentioned, the number of subtasks corresonds to the number of the available rocessors. As a result, there is no need for comutation scaling. The initial distribution of the blocks of the data being sorted among the rocessors may be randomly chosen. In order to execute the discussed arallel sorting algorithm efficiently, it is necessary that all the rocessors with the neighboring numbers should have direct communication lines Efficiency Analysis Let us estimate the general comlexity of the discussed arallel sort algorithm and then add the comlexity characteristics of the erformed communications to the obtained relations. Let us first determine the comlexity of the sequential comutations. The bubble sort algorithm allows to demonstrate a very imortant asect in consideration of this roblem. As it has been already mentioned, the method of data sorting used for arallelizing is characterized by a square deendence of comlexity with resect to the number of data being sorted, i.e. T 1 ~ n. However, alication of this nonotimal comlexity estimation of the sequential algorithm will lead to the distortion of the quality criteria sense of arallel comutations. In this case the efficiency characteristics would rather refer to the arallel execution of a given sort method than to the effectiveness of using arallelism for solving the roblem of data sorting on the whole. The difference is that more efficient sequential algorithms may be used for sorting and comlexity of these algorithms is the order: T 1 = nlog n. (10.1) It is essential to use this very comlexity estimation in order to comare, how faster the data may be sorted by means of arallel comutations. As a result, we can formulate the following: the efficiency of the best sequential algorithm should be used as the estimation of the comlexity for the sequential method of solving the roblem under consideration in determining the seedu and efficiency characteristics for arallel comutations. Parallel methods of solving roblems should be comared to the most efficient fast sequential comutational methods! Let us determine now the comlexity of the described arallel algorithm of data sorting. As it has been reviously mentioned, each rocessor at the initial stage of the method oeration sorts out its data blocks (the size of blocks in case of equal data distribution is equal to n/). Let us assume that this initial sorting may be erformed by means of the best sequential sort algorithms. The comlexity of the initial comutational stage may be determined in this case by the following relation: 1 T = n / ) log ( n / ). (10.) ( Then at each iteration of arallel sorting the interacting airs of rocessors exchange the blocks with each other. The block airs formed on each rocessor are united using the merge rocedure. The total number of iterations does not exceed the value. As a result, the total number of oerations in this art of arallel comutations aears to be equal to the following: T = ( n / ) = n. (10.3) 5

6 With regard to the obtained relations the efficiency and seedu characteristics for the arallel method of data sorting look as follows: S E n log n = = ( n / ) log ( n / ) + n log n log n = = (( n / ) log ( n / ) + n) log log n, ( n / ) + log n ( n / ) +. (10.) Let us enhance these exressions by taking into account the duration of the comutational oerations erformed and estimate the comlexity of the block exchange between the rocessors. In case when the Hockney model is used, the total execution time for all the block exchanges erformed in the course of sorting may be estimated by means of the following relation: T ( comm) ( α + w ( n ) / β ) =, (10.5) where α is the latency, β is the network bandwidth, and w is the size of the data element in bytes. With regard to the comlexity of the communication oerations the total execution time of the arallel data sort algorithm is determined by the following exression: T ( α + w ( n ) β ) = (( n / ) log ( n / ) + n)τ + /, (10.6) where τ is the execution time of the basic sorting oeration Comutational Exeriment Results The comutational exeriments for estimating the efficiency of the arallel bubble sort algorithm were carried out under the conditions described in In brief terms these conditions are the following. The exeriments were carried out on the comutational cluster on the basis of the rocessor Intel XEON EM6T 3000 Mhz and Gigabit Ethernet under OS Microsoft Windows Server 003 Standart x6 Edition (see 1..3). To estimate the duration τ of the basic sorting oeration we solve the roblem of ordering by means of a sequential algorithm. The time of comutations obtained this way was further divided by the total number of oerations. The value 10.1 nsec was obtained for τ as a result of the exeriments. The exeriments carried out in order to determine the network arameters showed the value of the latency α and the value of the network bandwidth β corresondingly 130 msec and 53.9 Mbyte/sec. All the comutations were executed with the numerical values of double tye, i.e. the value w is equal to 8 bytes. The results of the comutational exeriments are given in Table The exeriments were carried out with the use of two and four rocessors. Number of elements Table The results of the comutational exeriments for the arallel bubble sort algorithm Sequential algorithm Parallel algorithm rocessors rocessors Time Seedu Time Seedu 10, , , , ,

7 Seedu 0, , , , , , , , , , Number of rocesses elements 0000 elements elements 0000 elements elements Figure Seedu of the arallel bubble sort algorithm According to the exerimental results of the comutational exeriments the arallel bubble sort algorithm oerates more slowly than the original sequential method of bubble sorting. The reason for it is that the volume of the data transmitted among the rocessors is rather large and is comarable to the number of the executed comutational oerations (this disbalance of the amount of comutations and the comlexity of data communications grows with the increase of the number of rocessors). The comarison of the exerimental execution time T and the theoretical estimation T from (10.5) is given in Table 10. and Figure 10.. Table 10.. The comarison of the exerimental and theoretical execution time for the arallel bubble sort algorithm Data size T Parallel algorithm rocessors rocessors T T 10, , , , , T 7

8 0, , Time 0, , ,00000 Exeriment Model 0, , Number of elements Figure 10.. Exerimental and theoretical execution time for Processors 10.. Shell Sort Sequential Algorithm In case of the Shell sort algorithm (see, for instance, например, Knuth (1997), Cormen et al. (001)) from the very beginning the comared airs of values are formed from elements that are located rather far from each other in the sorted data. This modification of the sort method makes ossible to ermute unsorted airs of distant located values fast enough (sorting such airs usually requires a greater number of ermutation oerations, if only neighboring elements are comared). The general scheme of the method is described below. The elements of n/ airs (a i, a n/+i ) for 1 i n/.are sorted during the first ste of the algorithm. The elements of n/ grous of four elements each (a i, a n/+1, a n/+1, a 3n/+1 ) for 1 i n/ are sorted during the second ste. During the third ste the elements of n/8 grous of eight elements each are sorted etc. All the elements of the array (a 1, a,, a n) are sorted at the last ste. The insertion sort method is used at each ste for sorting elements in grous. As it can be noted the total number of iterations of the Shell algorithm is equal to log n. The Shell algorithm can be resented in a simler way as it is shown below: // Algorithm 10. // Sequential algorithm of Shell sorting ShellSort(double A[], int n){ int incr = n/; while( incr > 0 ) { for ( int i=incr+1; i<n; i++ ) { j = i-incr; while ( j > 0 ) if ( A[j] > A[j+incr] ){ swa(a[j], A[j+incr]); j = j - incr; else j = 0; incr = incr/; Algorithm Parallel Algorithm The sequential Shell sort algorithm A arallel variant of the Shell sort method may be suggested (see, for instance, Kumar et al. (003)), if the communication network toology may be resented as an N-dimensional hyercube (if the number of rocessors is 8

9 equal to = N ). In this case sorting may be subdivided into two sequential stages. The interaction of the rocessors neighboring in the hyercube structure takes lace at the first stage (N iterations). These rocessors may aear to be rather far from each other in case of linear enumeration. The required maing the hyercube toology into the linear array structure may be imlemented using the Gray code (see Section 3). Forming the airs of rocessors interacting with each other during the comare-slit oeration may be rovided by means of the following simle rule: the rocessors whose bit codes of their numbers differ only in osition N-i are aired at each iteration i, 0 i < N. At the second stage the usual iterations of the arallel odd-even transosition algorithm are erformed. The iterations of this stage are executed u to the actual termination of changes of the data being sorted. Thus, the total number L of such iterations may vary from to. Figure 10.3 shows the examle of sorting the array, which consists of 16 elements by means of the discussed method. It should be noted that the data aears to be sorted after the comletion of the first stage, and there is no need to execute the odd-even transosition iterations iteration after the comletion of the second iteration iteration Figure The examle of the use of the arallel Shell algorithm for rocessors (the rocessors are marked by circles, the rocessor numbers are given in their binary reresentation) With regard to the given descrition the same decomosition aroach can be alied and define the comareslit oeration as the basic comutational subtask. As a result, the number of subtasks will coincides with the number of the available rocessors (the size of the data blocks in the subtasks is equal to n/). As a result, scaling the comutations is not needed again. The distribution of the sorted data among the rocessors should be selected with regard to the efficient imlementation of the comare-slit oeration in the hyercube network toology Efficiency Analysis The relations obtained for arallel bubble sort method of (see Subsection ) may be used for estimating the efficiency of the arallel variant of the Shell algorithm. It is only necessary to take into account the two stages of the Shell algorithm. With regard to this eculiarity the total execution time for the new arallel method may be determined by means of the following exression: = n / ) log ( n / ) τ + (log + L)[(n / ) τ + ( α + w ( n ) / )]. (10.7) T ( β As it can be noted, the efficiency of the arallel variant of Shell sorting deends considerably on the value L. If the value L is small, the new arallel sorting method is executed more quickly than the reviously described oddeven transosition algorithm Comutational Exeriment Results The comutational exeriments for estimating the efficiency of the Shell sort arallel method were carried out under the same conditions as the exeriments described reviously (see 10.3.). The results of the comutational exeriments are given in Table 10.. The exeriments were carried out with the use of and rocessors. The time is given in seconds. Table 10.. The results of the comutational exeriments for the arallel Shell sort algorithm Number of Sequential Parallel algorithm 9

10 elements algorithm rocessors rocessors Time Seedu Time Seedu 10, , , , , Seedu 0, , , , , , , , , , Number of Processors elements 0000 elements elements 0000 elements elements Figure 10.. Seedu the arallel Shell sort algorithm The comarison of the exerimental execution time T and the theoretical estimation T from (10.7) is given in Table 10.5 and Figure Table The comarison of the exerimental and theoretical execution time for the Shell sort arallel algorithm Number of elements T Parallel algorithm rocessors rocessors T T 10, , , , , T 10

11 0, , , Time 0, , Exeriment Model 0, , , Number of elements Figure Exerimental and theoretical execution time for rocessors Quick Sort Sequential Algorithm In the general consideration of the quick sort algorithm suggested by Hoare C.A.R., first of all it should be noted that the method is based on the sequential subdividing the sorted data into blocks of smaller sizes in such a way that the ordering relation is rovided among the values of different blocks (for any air of blocks all the values of one of the blocks do not exceed the values of the other one). The division of the original data into the first two arts is erformed at the first iteration of the method. A certain ivot element is selected for roviding this division, and all the values of the data, which are smaller that the ivot element, are transferred to the first block being formed. All the rest of the values form the second block of the sorted data. These rules are alied recursively for the two created blocks on the second iteration of the sorting etc. If the choice of the ivot elements is adequate, than the initial data array aears to be sorted after the execution of log n iterations. More detailed information concerning the method may be found in Knuth (1997), Cormen et al. (001). The quick sort method efficiency is determined to a great extent by the choice of the ivot elements during the data division into blocks. At worst case the comlexity of the method is of the same order of comlexity as the bubble sort method (i.e. T 1 ~ n ). If the choice of the ivot elements is otimal, than each block is divided into equal sized arts and the comlexity of the algorithm coincides with the comlexity of the most efficient sort methods ( T1 ~ nlog n ). On average the number of the oerations carried out by the quick sort algorithm is determined by the following exression (see, for instance, Knuth (1997), Cormen et al. (001)): T 1 = 1.n log n. The general scheme of the quick sorting algorithm may be given in the following form (the ivot element is determined by the first element value of the sorted data): // Algorithm 10.5 // The sequential Algorithm of Quick Sorting QuickSort(double A[], int i1, int i) { if ( i1 < i ){ double ivot = A[i1]; int is = i1; for ( int i = i1+1; i<i; i++ ) if ( A[i] ivot ) { is = is + 1; swa(a[is], A[i]); swa(a[i1], A[is]); QuickSort (A, i1, is); QuickSort (A, is+1, i); 11

12 Algorithm The sequential quick sort slgorithm The Parallel Quick Sort Algorithm Parallel Comutational Scheme The arallel generalization of the quick sorting algorithm (see, for instance, Quinn (00)) may be obtained in the simlest way for a comuter system, the toology of which is an N-dimensional hyercube (i.e. = N ). Let the initial data, as reviously, be distributed among the rocessors in blocks of the same size n/. The resulting location of blocks must corresond to the enumeration of the hyercube rocessors. Under these conditions a ossible method to execute the first iteration of the arallel method is the following: Select the ivot element and broadcast it to all the rocessors (for instance, the arithmetic mean of the elements of some ivot rocessor may be chosen as the ivot element); Subdivide the data block available on each rocessor into two arts using the ivot element; Form the airs of rocessors, for which the bit resentation of the numbers differs only in N osition. After that the exchange of the data among these rocessors should be executed. As a result of these data transmissions, the arts of the blocks with the values smaller than the ivot element must aear on the rocessors, for which the bit osition N of the rocessor numbers are equal to 0. The rocessors with the numbers in which the bit N is equal to 1 must collect corresondingly all the data values exceeding the value of the ivot element. As a result of executing this iteration, the initial data aear to be subdivided into two arts. One of them (with the values smaller than the ivot element value) is located on the rocessors, whose numbers hold 0 in the N-th bit. There are only / such rocessors. Thus, the initial N-dimensional hyercube also is subdivided into two subhyercubes of N-1 dimension. The above described rocedure may also be alied to these subhyercubes. After executing N such iterations, it is sufficient to sort the data blocks which have been formed on each searate rocessor to terminate the method. To illustrate the arallel quick algorithm Figure 10.6 shows the examle of sorting data when n=16, = (i.е. each rocessor block holds four elements). The rocessors are shown as rectangles, the data blocks being sorted are shown inside the rectangles. The block values are given at the beginning and at the comletion of each sorting iteration. The interacting airs of rocessors are linked by double-headed arrows. The otimal values of the ivot elements were chosen for data artitioning. At the first iteration the value 0 was used for all the rocessors. At the second iteration for the air of rocessors (0, 1) the ivot element was equal to, for the air (, 3) the value was chosen to be equal to iteration beginning 1 iteration comletion (the leading element =0 Proc. Proc.3 Proc. Proc Proc.0 Proc.1 Proc.0 Proc.1 Figure iteration beginning iteration comletion Proc. Proc.3 Proc. Proc Proc.0 Proc.1 Proc.0 Proc.1 The examle of sorting data by the arallel quick sort method of (the results of local block sorting are not included) As reviously, the basic comutational subtask may be the comare-slit oeration. The number of the subtasks coincides with the number of the rocessors used. The distribution of the subtasks among the rocessors should be done with regard to the efficient algorithm execution for the hyercube network toology. 1

13 Efficiency Analysis Let us estimate the comlexity of the described arallel method. Let us assume that we have an N-dimensional hyercube (i.e. = N ) and <n. The efficiency of the arallel quick sort method deends largely on the otimality of the ivot element choice, as it was in case of the sequential variant. It is rather comlicated to work out the general rule for the selection of these values. But this choice can be imlemented easier if at the beginning of the method execution the rocessor data blocks are sorted. It is also useful to rovide the more uniform data distribution among the rocessors. Let us determine the comutational comlexity of the sort algorithm. At each of log sorting iterations each rocessor divides the data block with regard to the ivot element. The comlexity of this stage is n/ oerations (let us consider the best ossible case that each block is divided into equally sized arts at each sorting iteration). After the termination of the comutations the rocessors carry out sorting the blocks. It may be done in (n/)log (n/) oerations by means of using the quick sort algorithm. Thus, the total comutational time for the arallel quick sort algorithm is the following: T calc) = [( n / ) log + ( n / ) log ( n / )]τ, (10.8) ( where τ is the execution time of the basic sorting oeration. Let us consider the comlexity of the communication oerations. The total number of the rocessor communications to broadcast the ivot elements for the N-dimensional hyercube may be evaluated by the following estimation: N i = N( N + 1) / = log (log + 1) / ~ (log ). (10.9) i = 1 With regard to the assumtion we have made (the choice of the ivot elements is otimal), we define the number of the algorithm iterations as equal to log, and the amount of the transmitted data as always equal to a half of the block, i.e. (n/)/. Under these conditions, the communication comlexity of the arallel algorithm for the quick sort method is determined by means of the following relation: ( β T comm) = (log ) ( α + w / β ) + log ( α + w( n / ) / ), (10.10) where α is the latency, β is the network bandwidth, and w is the size of the set element in bytes. Finally we may determine the algorithm time comlexity by the following exression: T = [( n / ) log + ( n / ) log( n / )] τ + (log ) ( α + w / β ) + log ( α + w( n / ) / β ). (10.11) Comutational Exeriments Results The comutational exeriments for estimating the efficiency of the arallel quick sort method were carried out under the same conditions as the exeriments described reviously (see 10.3.). The results of the comutational exeriments are given in Table The exeriments were carried out with the use of and rocessors. The time is given in seconds. Number of elements Table The results of the comutational exeriments for the arallel quick sort algorithm Sequential algorithm Parallel algorithm rocessors rocessors Time Seedu Time Seedu 10, , , , ,

14 Seedu 1, , , , , , , , , Number of rocesses elements 0000 elements elements 0000 elements elements Figure Seedu of the arallel quick sort algorithm According to the results of the comutational exeriments, the arallel quick sort algorithm allows to seed u solving the roblem of data sorting. The comarison of the exerimental execution time T and the theoretical estimation T from (10.11) is given in Table 10.7 and Figure Table The comarison of the exerimental and theoretical execution time for the arallel quick sort algorithm Data size T Parallel algorithm rocessors rocessors T T 10, , , , , T 0, , ,00000 Time 0, ,00000 Exeriment Model 0, , Number of elements 1

15 Figure Exerimental and theoretical execution time for rocessors The Parallel HyerQuickSort Algorithm In addition to the above described quick sort method, there is a generalized technique called the HyerQuickSort algorithm which suggests a secific scheme for choosing the ivot elements. In accordance with this scheme the data blocks located on the rocessors should be sorted at the very beginning of the comutations. Besides, the rocessors should merge the arts of blocks obtained after their artitioning so as to maintain data ordering in the course of comutations. As a result, due to the regularity of blocks, it is reasonable to choose the average element of some block (for instance, the block on the first rocessor) as the ivot element at each iteration of the quick sort algorithm. In some cases the ivot elements selected in such a way may aear to be very closer to real arithmetic mean of the data being sorted than any other randomly chosen value. All the other oerations in the algorithm being described are executed according to the original quick sort method. In detail the HyerQuickSort algorithm is described, for instance, in Quinn (00). It is ossible to use the relation (10.11) for analyzing the efficiency of the HyerQuickSort algorithm. It should be noted that the oeration of merging block arts is carried out at each method iteration (as reviously we assume that the size of the block arts is the same and is equal to n/)/). Besides, the rocedure of artitioning may be modified due to the block regularity. It is sufficient to carry out the binary search for the ivot element osition in a block instead of exhaustive linear search through all the block elements. With regard to this, the comlexity of the HyerQuickSort algorithm may be determined by means of the following exression: T = [( n / ) log( n / ) + (log( n / ) + ( n / ))log ] τ + (log ) ( α + w / β ) + log ( α + w( n / ) / β ).(10.1) Software Imlementation Let us discuss a ossible variant of software imlementation of the HyerQuickSort algorithm. It should be noted that rogram code of several modules is not given as its absence does not influence the understanding of the general scheme of arallel comutations. 1. The main function. The main function imlements the comutational method scheme by sequential calling out the necessary subrograms. // Program // The HyerQuickSort Method int ProcRank; // Rank of current rocess int ProcNum; // Number of rocesses int main(int argc, char argv[]) { double ProcData; // Data block for the rocess int ProcDataSize; // Data block size MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &ProcRank); MPI_Comm_size(MPI_COMM_WORLD, &ProcNum); // Data Initialization and their distribution among the rocessors ProcessInitialization ( &ProcData, &ProcDataSize); // Parallel sorting ParallelHyerQuickSort ( ProcData, ProcDataSize ); // The termination of rocess comutations ProcessTermination ( ProcData, ProcDataSize ); MPI_Finalize(); The function ProcessInitialization determines the initial data for the roblem being solved (the size of the data being sorted). It also allocates memory for data storage and generates the data being sorted (for instance, by means of random number generator). The function also distributes the data among the rocesses. The function ProcessTermination erforms the necessary outut of the sorted data and releases all the reviously allocated memory for storing the data. The imlementation of all the above mentioned functions may be erformed on the analogy with the examles, which have been discussed earlier and is given to the reader a training exercise. 15

16 . The function ParallelHyerQuickSort. It erforms arallel quick sorting according to the algorithm, which has been described above. // The Parallel HyerQuickSort Method void ParallelHyerQuickSort ( double ProcData, int ProcDataSize) { MPI_Status status; int CommProcRank; // Rank of the rocessor involved in communications double MergeData, // Block obtained after merging the block arts Data, // Block art, which remains on the rocessor SendData, // Block art, which is sent to the rocessor CommProcRank RecvData; // Block art, which is received from the roc CommProcRank int DataSize, SendDataSize, RecvDataSize, MergeDataSize; int HyercubeDim = (int)ceil(log(procnum)/log()); // Hyercube dimension int Mask = ProcNum; double Pivot; // Local data sorting LocalDataSort ( ProcData, ProcDataSize ); // Hyerquicksort iterations for ( int i = HyercubeDim; i > 0; i-- ) { // Determination of the ivot value and broadcast it to rocessors PivotDistribution (ProcData,ProcDataSize,HyercubeDim,Mask,i,&Pivot); Mask = Mask >> 1; // Determination of the data division osition int os = GetProcDataDivisionPos (ProcData, ProcDataSize, Pivot); // Block division if ( ( (rank&mask) >> (i-1) ) == 0 ) { // high order bit= 0 SendData = & ProcData[os+1]; SendDataSize = ProcDataSize - os 1; if ( SendDataSize < 0 ) SendDataSize = 0; CommProcRank = ProcRank + Mask Data = & ProcData[0]; DataSize = os + 1; else { // high order bit = 1 SendData = & ProcData[0]; SendDataSize = os + 1; if ( SendDataSize > ProcDataSize ) SendDataSize = os; CommProcRank = ProcRank Mask Data = & ProcData[os+1]; DataSize = ProcDataSize - os - 1; if ( DataSize < 0 ) DataSize = 0; // Sending the sizes of the data block arts MPI_Sendrecv(&SendDataSize, 1, MPI_INT, CommProcRank, 0, &RecvDataSize, 1, MPI_INT, CommProcRank, 0, MPI_COMM_WORLD, &status); // Sending the data block arts RecvData = new double[recvdatasize]; MPI_Sendrecv(SendData, SendDataSize, MPI_DOUBLE, CommProcRank, 0, RecvData, RecvDataSize, MPI_DOUBLE, CommProcRank, 0, MPI_COMM_WORLD, &status); // Data merge MergeDataSize = DataSize + RecvDataSize; MergeData = new double[mergedatasize]; DataMerge(MergeData, MergeData, Data, DataSize, RecvData, RecvDataSize); delete [] ProcData; delete [] RecvData; 16

17 ProcData = MergeData; ProcDataSize = MergeDataSize; The function LocalDataSort sorts the data block on each rocessor using the sequential quick sort algorithm. The function PivotDistribution determines the ivot element and sends its value to all the rocessors. The function GetProcDataDivisionPos calculates the osition of the data block artition with resect to the ivot element. The result of the function is the integer number, which determines the osition of the element on the border of two blocks. The function DataMerge merges the data arts into the sorted data block. 3. The function PivotDistribution. This function selects the ivot element and sends it to all the hyercube rocessors. As the data located on the rocessors have already been sorted, the ivot element is selected as the middle element of the data block. // Determination of the ivot value and broadcast it to all the rocessors void PivotDistribution (double ProcData, int ProcDataSize, int Dim, int Mask, int Iter, double Pivot) { MPI_Grou WorldGrou; MPI_Grou SubcubeGrou; // a grou of rocessors a subhyercube MPI_Comm SubcubeComm; // subhyercube communcator int j = 0; int GrouNum = ProcNum /(int)ow(, Dim-Iter); int ProcRanks = new int [GrouNum]; // Forming the list of ranks for the hyercube rocesses int StartProc = ProcRank GrouNum; if (StartProc < 0 ) StartProc = 0; int EndProc = (ProcRank + GrouNum; if (EndProc > ProcNum ) EndProc = ProcNum; for (int roc = StartProc; roc < EndProc; roc++) { if ((ProcRank & Mask)>>(Iter) == (roc & Mask)>>(Iter)) { ProcRanks[j++] = roc; // Creating the communicator for the subhyercube rocesses MPI_Comm_grou(MPI_COMM_WORLD, &WorldGrou); MPI_Grou_incl(WorldGrou, GrouNum, ProcRanks, &SubcubeGrou); MPI_Comm_create(MPI_COMM_WORLD, SubcubeGrou, &SubcubeComm); // Selecting the ivot element and seding it to the subhyercube rocesses if (ProcRank == ProcRanks[0]) Pivot = ProcData[(ProcDataSize)/]; MPI_Bcast ( Pivot, 1, MPI_DOUBLE, 0, SubcubeComm ); MPI_Grou_free(&SubcubeGrou); MPI_Comm_free(&SubcubeComm); delete [] ProcRanks; Comutational Exeriments Results The comutational exeriments for estimating the efficiency of the arallel variant of the HyerQuickSort method were carried out under the same conditions as the exeriments described reviously (see 10.3.). The results of the comutational exeriments are given in Table The exeriments were carried out with the use of and rocessors. The time is given in seconds. Table The results of the comutational exeriments for the arallel HyerQuickSort algorithm Number of elements Sequential algorithm Parallel algorithm rocessors rocessors Time Seedu Time Seedu 17

18 10, , , , , Seedu 1, , , , , , , , , , Number of elements elements 0000 elements elements 0000 elements elements Figure Seedu of the arallel HyerQuickSort algorithm The comarison of the exerimental execution time T and the theoretical estimation T from (10.1) is given in Table 10.9 and Figure Table The comarison of the exerimental and theoretical execution time for the arallel HyerQuickSort algorithm Data size T Parallel algorithm rocessors rocessors T T 10, , , , , T 18

19 0, , Time 0, , ,00000 Exeriment Model 0, , Number of elements Figure Exerimental and theoretical execution time for rocessors The Parallel Sorting by Regular Samling Parallel Comutational Scheme The algorithm of the Parallel Sorting by regular samling is also a generalization of the quick sort method (see, for instance, Quinn (00)). To sort data in accordance with this new variant of the quick sort algorithm the following four stages should be imlemented: In the first stage of the algorithm the blocks located on the rocessors are sorted. This oeration may be executed by each rocessor indeendently of the other rocessors by means of the original quick algorithm. Each rocessor then forms a set of elements of its blocks with the indices 0, m, m,,(-1)m, where m=n/ (this set can be considered as regular samles of the rocessor data block); In the second stage of the algorithm all the data sets, which have been formed on the rocessors, are accumulated on a rocessor and are merged in a single sorted set. Then the values of this set with the indices / 1, + / 1,...,( 1) / + + form a new set of the ivot elements, then this set is transmitted to all the rocessors being used. At the end of the stage each rocessor artitions its block into arts using the obtained set of the ivot values; In the third stage of the algorithm each rocessor sends the selected arts of its block to all the other rocessors. It is done in accordance with the enumeration order the art j, 0 j<, of each block is transmitted to the rocessor j; In the fourth stage of the algorithm each rocessor merges the obtained arts in a single sorted block. After the termination of the stage the initial data become sorted. Figure shows an examle of data sorting by means of the algorithm, which is described above. It should be noted that the number of rocessors for the given algorithm may be arbitrary. In this examle it is equal to 3. 19

20 1 stage 1: : 3: stage : 3 stage : : : : : stage 1: : 3: Figure The examle of executing the quick sorting algorithm by regular samling for 3 rocessors Efficiency Analysis Let us estimate the comlexity of this arallel method. Let n be the amount of the sorted data,, <n, denotes the number of the rocessors being used, and corresondingly n/ is the size of the data blocks on the rocessors. During the first stage of the algorithm each rocessor sorts its data block by means of the quick sort method. Thus, the duration of the oerations erformed is equal to the following: 1 T = n / )log ( n / ) τ, ( where τ is the execution time of the basic sorting oeration. During the second stage of the algorithm one of the rocessors accumulates the sets of elements from all the other rocessors and merges the obtained data (the total number of the elements is equal to ), and forms the set of -1 ivot elements. Then the rocessor transmits the set of ivot elements to the other rocessors. Taking into account all the above mentioned oerations we may determine the duration of the second stage as follows: [ T = α log + w( 1) / β ] + [ log τ ] + [ τ ] + [log ( α + w / β )] (the subexressions in square brackets corresond to the four above mentioned oerations); in this case, as reviously, α is the latency, β is the network bandwidth, and w is the size of the set element in bytes. During the third stage of the algorithm each rocessor divides its block with regards to the ivot elements into arts (the total number of the oerations for this urose may be limited by the value n/). Then all the rocessors transmit the formed arts of blocks to each other. The comlexity estimation of this communications in case of the hyercube network toology was considered in Section 3. It was shown that the execution of this oeration might be carried out in log stes. Each rocessor at each ste transmits and receives a message of (n/)/ elements. As a result, the comlexity of the third stage may be estimated as follows: 3 T = n / ) τ + log ( α + w( n / ) / β ). ( During the fourth stage each rocessors merges sorted arts in a single sorted block. The estimation of comlexity for the oerations was carried out in the course of the consideration of the second stage. Thus, the duration of the merge rocedure execution is as follows: T = ( n / )log τ. With regard to all the obtained relations the total execution time for the arallel sorting by regular samling may be estimated as follows: 0

Introduction to Parallel. Programming

Introduction to Parallel. Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 10. Programming Parallel Methods for Sorting Gergel V.P., Professor, D.Sc., Software Department