CHAPTER 5 Pipelined Computations

Size: px

Start display at page:

Download "CHAPTER 5 Pipelined Computations"

Kory Price
5 years ago
Views:

1 CHAPTER 5 Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming. Each task will be executed by a separate process or processor. P 0 P P 2 P 3 P 4 P 5 Figure 5. Pipelined processes. Example Add all the elements of array a to an accumulating sum: for (i = 0; i < n; i++) sum = sum + a[i]; The loop could be unfolded to yield sum = sum + a[0]; sum = sum + a[]; sum = sum + a[2]; sum = sum + a[3]; sum = sum + a[4];. Page 00

2 a[0] a[] a[2] a[3] a[4] a a a a a sum s in s out s in s out s in s out s in s out s in s out Figure 5.2 Pipeline for an unfolded loop. Another Example A frequency filter - The objective here is to remove specific frequencies (say the frequencies f 0, f, f 2, f 3, etc.) from a (digitized) signal, f(t). The signal could enter the pipeline from the left: Signal without frequency f 0 Signal without frequency f Signal without frequency f 2 Signal without frequency f 3 f(t) f in f 0 f f 2 f 3 f 4 f out f in f out f in f out f in f out f in f out Filtered signal Figure 5.3 Pipeline for a frequency filter. Page 0

3 Given that the problem can be divided into a series of sequential tasks, the pipelined approach can provide increased speed under the following three types of computations:. If more than one instance of the complete problem is to be executed 2. If a series of data items must be processed, each requiring multiple operations 3. If information to start the next process can be passed forward before the process has completed all its internal operations Type Pipeline Space-Time Diagram p m P 5 P 4 P 3 P 2 P P 0 Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Instance Time Figure 5.4 Space-time diagram of a pipeline. Page 02

4 Instance 0 Instance Instance 2 Instance 3 Instance 4 P 0 P P 2 P 3 P 4 P 5 P 0 P P 2 P 3 P 4 P 5 P 0 P P 2 P 3 P 4 P 5 P 0 P P 2 P 3 P 4 P 5 P 0 P P 2 P 3 P 4 P 5 Time Figure 5.5 Alternative space-time diagram. Type 2 Pipeline Space-Time Diagram Input sequence d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d d 0 P 0 P P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 p (a) Pipeline structure n P 9 P 8 P 7 P 6 P 5 P 4 P 3 P 2 P P 0 d 0 d d 2 d 3 d 4 d 5 d 6 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 0 d d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 Figure 5.6 Time (b) Timing diagram Pipeline processing 0 data elements. Page 03

5 Type 3 Pipeline Space-Time Diagram P 5 P 5 P 4 P 4 Information transfer sufficient to start next process P 2 P 3 P 2 P 3 P 0 P Information passed to next stage P 0 P Time (a) Processes with the same execution time (b) Processes not with the same execution time Time Figure 5.7 Pipeline processing where information passes to next stage before end of process. If the number of stages is larger than the number of processors in any pipeline, a group of stages can be assigned to each processor: Processor 0 Processor Processor 2 P 0 P P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 0 P Figure 5.8 Partitioning processes onto processors. Page 04

6 Computing Platform for Pipelined Applications Multiprocessor Host computer Figure 5.9 Multiprocessor system with a line configuration. Pipeline Program Examples Adding Numbers Σ i Σ 2 i Σ 3 i Σ 4 i Σ 5 i P 0 P P 2 P 3 P 4 Figure 5.0 Pipelined addition. Page 05

7 The basic code for process P i : recv(&accumulation, P i- ); accumulation = accumulation + number; send(&accumulation, P i+ ); except for the first process, P 0, which is send(&number, P ); and the last process, P n, which is recv(&number, P n-2 ); accumulation = accumulation + number; SPMD program if (process > 0) { recv(&accumulation, P i- ); accumulation = accumulation + number; } if (process < n-) send(&accumulation, P i+ ); The final result is in the last process. Instead of addition, other arithmetic operations could be done. Page 06

8 Master process Slaves d n d 2 d d 0 Sum P 0 P P 2 P n Figure 5. Pipelined addition numbers with a master process and ring configuration. Master process Numbers d 0 d Slaves d n P 0 P P 2 P n Sum Figure 5.2 Pipelined addition of numbers with direct access to slave processes. Page 07

9 Analysis Our first pipeline example is Type. We will assume that each process performs similar actions in each pipeline cycle. Then we will work out the computation and communication required in a pipeline cycle. Total execution time t total = (time for one pipeline cycle)(number of cycles) t total = (t comp + t comm )(m + p ) where there are m instances of the problem and p pipeline stages (processes). The average time for a computation is given by t a = t total m Single Instance of Problem t comp = t comm = 2(t startup + t data ) t total = (2(t startup + t data ) + )n Tme complexity = Ο(n). Page 08

10 Multiple Instances of Problem t total = (2(t startup + t data ) + )(m + n ) t a = t total m 2(t startup + t data ) + That is, one pipeline cycle Data Partitioning with Multiple Instances of Problem t comp = d t comm = 2(t startup + t data ) t total = (2(t startup + t data ) + d)(m + n/d ) As we increase the d, the data partition, the impact of the communication diminishes. But increasing the data partition decreases the parallelism and often increases the execution time. Page 09

11 Sorting Numbers 4, 3,, 2, 5 P 0 P P 2 P 3 P 4 A parallel version of insertion sort. (The sequential version is akin , 3,, 2 4, 3, , to placing playing cards in order by moving cards over to insert a card in position ) Time (cycles) Figure 5.3 Steps in insertion sort with five numbers. The basic algorithm for process P i is recv(&number, P i- ); if (number > x) { send(&x, P i+ ); x = number; } else send(&number, P i+ ); With n numbers, how many the ith process is to accept is known; it is given by n i. How many to pass onward is also known; it is given by n i since one of the numbers received is not passed onward. Hence, a simple loop could be used. Page 0

12 P Smaller 0 P P 2 numbers Series of numbers x n x x 0 Compare x max Largest number Next largest number Figure 5.4 Pipeline for sorting using insertion sort. Master process d n d 2 d d 0 Sorted sequence P 0 P P 2 P n Figure 5.5 Insertion sort with results returned to the master process using a bidirectional line configuration. Page

13 Incorporating results being returned, process i could have the form right_procno = n - i - ; /*no of processes to the right */ recv(&x, P i- ); for (j = 0; j < right_procno; j++) { recv(&number, P i- ); if (number > x) { send(&x, P i+ ); x = number; } else send(&number, P i+ ); } send(&number, P i- ); /* send number held */ for (j = 0; j < right_procno; j++) {/*pass on other nos */ recv(&x, P i+ ); send(&x, P i- ); } Analysis Sequential t s = ( n ) + ( n 2) = n( n ) Obviously very poor sequential sorting algorithm and unsuitable except for very small n. Each pipeline cycle requires at least Parallel t comp = t comm = 2(t startup + t data ) The total execution time, t total, is given by t total = (t comp + t comm )(2n ) = ( + 2(t startup + t data ))(2n ) Page 2

14 Sorting phase 2n Returning sorted numbers n P 4 Shown for n = 5 P 3 P 2 P P 0 Time Figure 5.6 Insertion sort with results returned. Prime Number Generation Sieve of Eratosthenes A series of all integers is generated from 2. The first number, 2, is prime and kept. All multiples of this number are deleted as they cannot be prime. The process is repeated with each remaining number. The algorithm removes nonprimes, leaving only primes. Example Suppose we want the prime numbers from 2 to 20. We start with all the numbers: After considering 2, we get 2, 3, 4, 5, 6, 7, 8, 9, 0,, 2, 3, 4, 5, 6, 7, 8, 9, 20 2, 3, 4, 5, 6, 7, 8, 9, 0,, 2, 3, 4, 5, 6, 7, 8, 9, 20 where numbers with are not prime and not considered further. After considering 3: 2, 3, 4, 5, 6, 7, 8, 9, 0,, 2, 3, 4, 5, 6, 7, 8, 9, 20 Subsequent numbers are considered in a similar fashion. To find the primes up to n, it is only necessary to start at numbers up to n. All multiples of numbers greater than n will have been removed as they are also a multiple of some number equal or less than n. Page 3

15 Sequential Code Usually employs an array with elements initialized to (TRUE) and set to 0 (FALSE) when the index of the element is not a prime number. Letting the last number be n and the square root of n be sqrt_n, we might have for (i = 2; i < n; i++) prime[i] = ; /* Initialize array */ for (i = 2; i <= sqrt_n; i++)/* for each number */ if (prime[i] == ) /* identified as prime */ for (j = i + i; j < n; j = j + i)/*strike multiples */ prime[j] = 0; /* includes already done */ The elements in the array still set to identify the primes (given by the array indices). Then a simple loop accessing the array can find the primes. Sequential Code Sequential time The number of iterations striking out multiples of primes will depend upon the prime. There are n/2 multiples of 2, n/3 multiples of 3, and so on. Hence, the total sequential time is given by t s = n -- 2 n n n 3 5 n assuming the computation in each iteration equates to one computational step. The sequential time complexity is Ο(n 2 ). Page 4

16 Pipelined Implementation Not multiples of st prime number P 0 P P 2 Series of numbers x n x x 0 Compare st prime 2nd prime 3rd prime multiples number number number Figure 5.7 Pipeline for sieve of Eratosthenes. The code for a process, P i, could be based upon recv(&x, P i- ); /* repeat following for each number */ recv(&number, P i- ); if ((number % x)!= 0) send(&number, P i+ ); Each process will not receive the same amount of numbers and the amount is not known beforehand. Use a terminator message, which is sent at the end of the sequence: recv(&x, P i- ); for (i = 0; i < n; i++) { recv(&number, P i- ); if (number == terminator) break; if (number % x)!= 0) send(&number, P i+ ); } Page 5

17 Solving a System of Linear Equations Special Case Type 3 example - process can continue with useful work after passing on information. To solve system of linear equations of the so-called upper-triangular form: a n,0 x 0 + a n, x + a n,2 x 2 + a n,n x n = b n.. a 2,0 x 0 + a 2, x + a 2,2 x 2 = b 2 a,0 x 0 + a, x = b a 0,0 x 0 = b 0 where the a s and b s are constants and the x s are unknowns to be found. Back Substitution. First, the unknown x 0 is found from the last equation; i.e., The value obtained for x 0 is substituted into the next equation to obtain x ; i.e., The values obtained for x and x 0 are substituted into the next equation to obtain x 2 : and so on until all the unknowns are found. b x 0 0 = a 0,0 b x a,0 x 0 = a, b x 2 a 2,0 x 0 a x 2, 2 = a 2,2 Page 6

18 Pipeline solution First pipeline stage computes x 0 and passes x 0 onto the second stage, which computes x from x 0 and passes both x 0 and x onto the next stage, which computes x 2 from x 0 and x, and so on. P 0 P P 2 P 3 x x 0 x x x x x Compute x 0 Compute x Compute x 2 x Compute x 2 3 x2 x 3 Figure 5.8 Solving an upper triangular set of linear equation using a pipeline. The ith process (0 < i < n) receives the values x 0, x, x 2,, x i- and computes x i from the equation: i b i a x i,j j x i = j a i,i Page 7

19 Sequential Code Given the constants a i,j and b k stored in arrays a[][] and b[], respectively, and the values for unknowns to be stored in an array, x[], the sequential code could be x[0] = b[0]/a[0][0]; /* x[0] computed separately */ for (i = ; i < n; i++) { /* for remaining unknowns */ sum = 0; for (j = 0; j < i; j++ sum = sum + a[i][j]*x[j]; x[i] = (b[i] - sum)/a[i][i]; } Parallel Code The pseudocode of process P i ( < i < n) of one pipelined version could be for (j = 0; j < i; j++) { recv(&x[j], P i- ); send(&x[j], P i+ ); } sum = 0; for (j = 0; j < i; j++) sum = sum + a[i][j]*x[j]; x[i] = (b[i] - sum)/a[i][i]; send(&x[i], P i+ ); Now we have additional computations to do after receiving and resending values. Page 8

20 P 5 Processes P 4 P 3 P 2 Final computed value P P 0 First value passed onward Time Figure 5.9 Pipeline processing using back substitution. Analysis Cannot assume that the computational effort at each pipeline stage is the same. The first process, P 0, performs one divide and one send(). The ith process (0 < i < n ) performs i recv()s, i send()s, i multiply/add, one divide/subtract, and a final send(), a total of 2i + communication times and 2i + 2 computational steps assuming that multiply, add, divide, and subtract are each one step. The last process, P n, performs n recv()s, n multiply/adds, and one divide/ subtract, a total of n communication times and 2n computational steps. Page 9

21 Intentionally blank Page 20

Pipelined Computations

Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming. Each