Parallel Processing IMP Questions

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Parallel Processing IMP Questions"

Transcription

1 Winter 14 Summer 14 Winter 13 Summer Parallel Processing IMP Questions Sr Chapter Questions Total What is Data Decomposition? Explain Data Decomposition with proper example. OR What is decomposition of task? Explain data decomposition in detail. OR Explain Recursive Decomposition technique to find minimum number from array. Draw task dependency graph for following data. 4, 9, 2, 6, 1, 7, 8, 11, 5, 3, 2, 12 Explain Bitonic sort with example. OR Discuss mapping of bitonic sort algorithm to a hypercube and a mesh. OR Write two rules for bitonic sequence in bitonic sorting network, explain the same with example. Briefly discuss bitonic sort and trace the following sequence using the same. 3, 5, 8, 9, 10, 12, 14, 20, 95, 90, 60, 40, 35, 23, 18, 0 Explain Prim s algorithm for minimum spanning tree. OR Explain parallel algorithm for Prim s algorithm and compare its complexity with the sequential algorithm for the same. Explain Odd-Even Transposition sort Algorithm. OR Write and Explain Algorithm for ODD-EVEN Transposition Sort. Also sort following Data. 1, 3, 8, 2, 9, 4, 6, 5 OR Explain odd-even sort in parallel environment and comment on its limitations. In context to Pthread, Explain normal, recursive and error check mutex. OR Write short note on how mutex-locks and condition variable is use for synchronizing shared data in Shared-address-space programming in Pthreads. OR Explain mutual exclusion for shared variable in Pthreads Enlist various performance metrics for parallel systems. Explain Speedup, Efficiency and total parallel overhead in brief Explain Matrix-Multiplication using DNS Algorithm Explain thread creation, termination and cancellation in detail in shared-address-space parallel system. OR Explain following functions with respect to Pthreads API. Also discuss arguments of these functions. i. pthread_create() ii. pthread_join() Explain parallel formulations of Dijkstra s algorithm Discuss buffered non-blocking and non-buffered non-blocking 10 6 send/receive message passing operations with suitable diagram P a g e

2 11 2 Explain Cache Coherence in Multiprocessor Systems. OR Explain invalidate protocol used for cache coherence in multiprocessor system. OR What is the meaning of memory latency? How memory latency can be improved by Cache? Define Isoefficiency function and derive equation of it P a g e

3 1) What is Data Decomposition? Explain Data Decomposition with proper example. OR What is decomposition of task? Explain data decomposition in detail. OR Explain Recursive Decomposition technique to find minimum number from array. Draw task dependency graph for following data. 5, 12, 11, 1, 10, 6, 8, 3, 7, 4, 9, 2. [Sum 14(14 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] 1. Recursive Decomposition: This decomposition technique uses divide and conquer strategy. In this tech., a problem is solved by first dividing into set of independent sub programs. Such subprogram is solved by recursively applying similar division into smaller subprograms and so on. Ex : Quick-sort In above example an array A of n elements is sorted using quick-sort. It selects a pivot element X and partition A into 2 sub-arrays A0 and A1 such that all the elements in A0 are smaller than x and all the elements in A1 are greater than or equal to x. This partitioning step forms the divide step of the Algorithm. Each one of the subsequences A0 and A1 is sorted by recursively calling quick sort. Each one of these recursive calls further partitions the sub-arrays. 2. Data Decomposition: Decomposition of computation is done in 2 steps. Data on which computations are performed is partitioned. Partition of data is used to partition the computation into several tasks. Partitioning Output Data : In some computations, each element of o/p can be computed independently, so, partitioning the o/p data automatically creates decomposition of problem into tasks. Ex : Matrix Multiplication : Consider the problem of multiplying two n x n matrices A and B to yield a matrix C. Figure shows a decomposition of this problem into four tasks. The decomposition shown in Figure is based on partitioning the output matrix C into four sub matrices and each of the four tasks computes one of these sub matrices. 3 P a g e

4 Partitioning Input Data : In case of finding minimum, maximum or sum of array, o/p is not a known value. In such cases it is possible to partitioning i/p data. Task is created for each partition of i/p data and introduce concurrency.not directly solved, a followup is needed to combine the results. A task is created for each partition of the input data and this task performs as much computation as possible using these local data. The problem of determining the minimum number of a set of itemsets { 4,9,1,7,8,11,12,2 } can also be decomposed based on a partitioning of input data. Figure shows decomposition based on a partitioning of the input set of transactions. 4,9 1,7 8,11 12,2 4,1 8,2 1,2 Partitioning Both (i/o & o/p) Data : In some cases, in which it is possible to partition the output data, partitioning of input data can offer additional concurrency. Ex : Relational Database of Vehicals for processing the following query : MODEL="Civic" AND YEAR="2001" AND (COLOR="Green" OR COLOR="White") Partitioning Intermediate Data : Algorithms are often structured as multi-stage computations such that the output of one stage is the input to the subsequent stage. A decomposition of such an Algorithm can be derived by partitioning the input or the output data of an intermediate stage of the Algorithm. Partitioning intermediate data can sometimes lead to higher concurrency than partitioning input or output data. Let us revisit matrix multiplication to illustrate a decomposition based on partitioning intermediate data. Decompositions induced by a 2 x 2 partitioning of the output matrix C, have a maximum degree of concurrency of four. Degree of concurrency can be increased by introducing an intermediate stage in which eight tasks 1 4 P a g e

5 compute their respective product sub matrices and store the results in a temporary threedimensional matrix D, as shown in figure. The sub matrix Dk,i,j is the product of Ai,k and Bk,j. A partitioning of the intermediate matrix D induces a decomposition into eight tasks.after the multiplication phase, a relatively inexpensive matrix addition step can compute the result matrix C. 2) Explain bitonic sort. OR Discuss mapping of bitonic sort algorithm to a hypercube and a mesh. OR Write two rules for bitonic sequence in bitonic sorting network, explain the same with example. Briefly discuss bitonic sort and trace the following sequence using the same. 3, 5, 8, 9, 10, 12, 14, 20, 95, 90, 60, 40, 35 23, 18, 0 [Sum 14(14 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] Ans 1 - Bitonic sort: A bitonic sorting network sorts n elements in O(log 2 n) time. The key operation of the bitonic sorting network is the rearrangement of a bitonic sequence into a sorted sequence. A bitonic sequence is a sequence of elements <a 0, a 1,..., a n-1 > with the property that either : (1) There exists an index i, 0 i n-1, such that < a 0,., a i > is monotonically increasing and < a i+1,.,a n-1 > is monotonically decreasing, or (2) There exists a cyclic shift of indices so that (1) is satisfied The method that rearranges bitonic sequence to obtain monotonically increasing order is called bitonic sort. Let s = <a 0, a 1,..., a n-1 > be a bitonic sequence such that a 0 <=a 1 <=..., a n/2-1 and a n/2 <= a n/2+1 <=... <= a n-1. Consider the following subsequences of s: 5 P a g e

6 bi = min{ai, an/2+i } such that all the elements before bi are from the increasing part of the original sequence and all the elements after bi are from the decreasing part. bi = max{ai, an/2+i} is such that all the elements before bi are from the decreasing part of the original sequence and all the elements after are from the increasing part. Thus, the sequences s1 and s2 are bitonic sequences. Furthermore, every element of the first sequence is smaller than every element of the second sequence. The reason is that bi is greater than or equal to all elements of s1, is less than or equal to all elements of s2, and is greater than or equal to bi. Solution: Solution of sorting or rearranging bitonic sequence of size n by rearranging two smaller bitonic sequences and concatenating them. We refer to the operation of splitting a bitonic sequence of size n into the two bitonic sequences as a bitonic split. We can recursively obtain shorter bitonic sub-sequence using equation of S1 and S2; until we obtain subsequence of size one. Number of splits required to rearrange the bitonic sequence into sorted sequence is log n. This procedure of merging the sorting sequence is called bitonic merge shown below in figure. Example: Merging of bitonic sequence is easy to implement on network of comparators, called bitonic merging network. This network contains log n columns, each column contains n/2 comparators and performs 1 step of bitonic merge, and takes bitonic sequence as i/p and gives sorted sequence as o/p. 6 P a g e

7 If we replace same n/w with decreasing comparator, then the i/p data will be sorted in monotonically decreasing order. Ans 2 - Mapping Bitonic Sort to a Hypercube and Mesh: 1. Hypercube: Compare-exchange scenario: In this mapping, each of the n processes contains one element of the input sequence. Graphically, each wire of the bitonic sorting network represents a distinct process. During each step of the algorithm, the compare-exchange operations performed by a column of comparators are performed by n/2 pairs of processes. If the mapping is poor, the elements travel a long distance before they can be compared. Ideally, wires that perform a compare-exchange should be mapped onto neighboring processes. In any step, the compare-exchange operation is performed between two wires only if their labels differ in exactly one bit. Processes are paired for their compare-exchange steps in a d-dimensional hypercube (that is, p = 2d). In the final stage of bitonic sort, the input has been converted into a bitonic sequence. Two Rules: During the first rule of this stage, processes that differ only in the d th bit of the binary representation of their labels (that is, the most significant bit) compare-exchange their elements. Thus, the compare-exchange operation takes place between processes along the d th dimension. Similarly, during the second rule of the algorithm, the compare-exchange operation takes place among the processes along the (d - 1)th dimension. Figure shows the last stage of process. 7 P a g e

8 The bitonic sort algorithm for a hypercube is shown below. The algorithm relies on the functions comp_exchange_max(i) and comp_exchange_min(i). These functions compare the local element with the element on the nearest process along the ith dimension and retain either the minimum or the maximum of the two elements procedure BITONIC_SORT(label, d) begin for i := 0 to d - 1 do for j := i downto 0 do if (i + 1) st bit of label j th bit of label then comp_exchange max(j); else comp_exchange min(j); end BITONIC_SORT Mesh : The connectivity of a mesh is lower than that of a hypercube, so it is impossible to map wires to processes such that each compare-exchange operation occurs only between neighboring processes. There are several ways to map the input wires onto the mesh processes. Some of these are illustrated in Figure. Each process in this figure is labeled by the wire that is mapped onto it. 8 P a g e

9 Figure (a) shows row-major mapping, (b) shows row-major snakelike mapping, and (c) shows row-major shuffled mapping. The compare exchange steps of the last stage of bitonic sort for the row-major shuffled mapping are shown in below Figure. 3) Explain parallel algorithm for Prim s algorithm and compare its complexity with the sequential algorithm for the same. [Win 14(7 marks), Sum 14(7 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight. If G is not connected, it cannot have a spanning tree. Prim's algorithm for finding an MST is a greedy algorithm. 9 P a g e

10 The algorithm begins by selecting an arbitrary starting vertex. It then grows the minimum spanning tree by choosing a new vertex and edge that are guaranteed to be in a spanning tree of minimum cost. The algorithm continues until all the vertices have been selected. Let G = (V, E, w) be the weighted undirected graph for which the minimum spanning tree is to be found, and let A = (ai, j) be its weighted adjacency matrix. The algorithm uses the set V T to hold the vertices of the minimum spanning tree during its construction. It also uses an array d[1..n] in which, for each vertex v (V - V T ), d [v] holds the weight of the edge with the least weight from any vertex in VT to vertex v. Each process Pi computes di[u]=min { d[v] / vє(v- V T )} during each iteration of while loop. The global minimum is then obtained over all di[u] by all-to-one reduction operation and sorted in Po. Po then inserts it to V T and broadcast u to all by one-to-all broadcast operation. 10 P a g e

11 Parallel formulation: The algorithm works in n outer iterations - it is hard to execute these iterations concurrently. The inner loop is relatively easy to parallelize. Let p be the number of processes, and let n be the number of vertices. Figure: The partitioning of the distance array d and the adjacency matrix A among p processes. The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned accordingly. In each step, a processor selects the locally closest node, followed by a global reduction to select globally closest node. This node is inserted into MST, and the choice broadcast to all processors. Each processor updates its part of the d vector locally. Time complexities for various operations: cost to select the minimum entry = O(n/p + log p). cost of a broadcast = O(log p). 11 P a g e

12 cost of local updation of the d vector = O(n/p). parallel time per iteration = O(n/p + log p). total parallel time is given = O(n 2 /p + n log p). corresponding iso-efficiency = O(p 2 log 2 p). 4) Enlist various performance metrics for parallel systems. Explain Speedup, Efficiency and total parallel Overhead in brief. [Win 14(7 marks), Sum 14(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 35 Marks] A number of metrics have been used based on the desired outcome of performance analysis. 1. Execution Time: The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. We denote the serial runtime by TS and the parallel runtime by TP. 2. Total Parallel Overhead: The overheads incurred by a parallel program are encapsulated into a single expression referred to as the overhead function. We define overhead function or total overhead of a parallel system as the total time collectively spent by all the processing elements over and above that required by the fastest known sequential algorithm for solving the same problem on a single processing element. We denote the overhead function of a parallel system by the symbol To. The total time spent in solving a problem summed over all processing elements is ptp. TS units of this time are spent performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by 3. Speedup: T 0 = pt p - T s When evaluating a parallel system, we are often interested in knowing how much performance gain is achieved by parallelizing a given application over a sequential implementation. Speedup is a measure that captures the relative benefit of solving a problem in parallel. It is defined as the ratio of the time taken to solve a problem on a single processing element to the time required to solve the same problem on a parallel computer with p identical processing elements. We denote speedup by the symbol S. Only an ideal parallel system containing p processing elements can deliver a speedup equal to p. In practice, ideal behavior is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm 4. Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed; it is defined as the ratio of speedup to the number of processing elements. 12 P a g e

13 In an ideal parallel system, speedup is equal to p and efficiency is equal to one. In practice, speedup is less than p and efficiency is between zero and one, depending on the effectiveness with which the processing elements are utilized. We denote efficiency by the symbol E. Mathematically, it is given by E = S / P 5) Explain odd-even sort in parallel environment and comment on its limitations. OR Discuss Odd-Even Transposition sort. OR Write and explain algorithm for Odd-Even transposition sort. Also sort following data 3, 2, 3, 8, 5, 6, 4, 1. [Win 14(7 marks), Sum 14(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 35 Marks] The odd-even transposition algorithm sorts n elements in n phases (n is even), each of which requires n/2 compare-exchange operations. This algorithm alternates between two phases, called Odd and Even phases Let <a 1, a 2,..., a n > be the sequence to be sorted. During the odd phase, elements with odd indices are compared with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs (a 1, a 2 ), (a 3, a 4 ),..., (a n-1, a n ) are compare-exchanged (assuming n is even). Similarly, during the even phase, elements with even indices are compared with their right neighbors. After n phases of odd-even exchanges, the sequence is sorted. During each phase of the algorithm, compare-exchange operations on pairs of elements are performed simultaneously. Example: Consider the one-element-per-process case. Let n be the number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-dimensional array. Element a i initially resides on process Pi for i = 1, 2,..., n. During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. Similarly, during the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor. This parallel formulation is presented in following Algorithm procedure ODD-EVEN_PAR (n) begin id := process's label for i := 1 to n do if i is odd then if id is odd then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); if i is even then if id is even then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); end for end ODD-EVEN_PAR P a g e

14 Odd-even sort in parallel environment: It is easy to parallelize odd-even transposition sort. During each phase of the algorithm, compareexchange operations on pairs of elements are performed simultaneously. Consider the one-element-per-process case. Let n be the number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-dimensional array. Element ai initially resides on process Pi for i = 1, 2,..., n. During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. The odd-even transposition sort is shown in following Figure. Similarly, during the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor. This parallel formulation is presented in following Algorithm. During each phase of the algorithm, the odd or even processes perform a compare- exchange step with their right neighbors A total of n such phases are performed; thus, the parallel run time of this formulation is Q(n). 6) Explain mutual exclusion for shared variable in Pthreads. OR Explain three types of mutex normal, recursive and error check in context to Pthread. [Win 14(7 marks), Sum 14(7 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(2 marks), Total: 30 Marks] If multiple threads asks for same region at the same time then race condition occurs. But various API provides mutual exclusion locks also known as mutex-lock for such shared data. It has two states, locked and unlocked. The code that manipulate shared variable should have lock associated with it. 14 P a g e

15 Thread that tries to update value of that variable should first acquire the lock on that variable. As only one lock on data is allowed at a time, no other thread can lock the same variable at the point of that time and so process that tries to lock already locked variables is blocked. Before leaving critical region value of locked variable should unlock. So other thread can update its value. At initial level all mutex locks are in unlocked state. Pthread API provides two functions to lock or unlock shared variables as follows: int pthread_mutex_lock(pthread_mutex_t *mutex_lock); int pthread_mutex_unlock(pthread_mutex_t *mutex_lock); If thread successfully locked some variable then it enters into critical section. If more than one blocked thread, then any one of them is allowd to enter in critical section based on scheduling policy. If any thread attempts to unlock the mutex that is already unlocked or which is locked by other thread then no effect is defined. Before using mutex, it should be initialized to unlock state by function pthread_mutex_init() as follows: int pthread_mutex_init(pthread_mutex_t pthread_mutexattrtype *lock_attr); *mutex_lock, const It is possible to reduce locking by another lock function which is called pthread_mutex_trylock(). int pthread_mutex_trylock(pthread_mutex_t *mutex_lock); If successfully locked it returns 0, otherwise it give that it is in busy state. Types of mutex: 1. Normal: It is default type locking. Only single thread is allowed to lock normal mutex once at any point in time. If thread with a lock attempts to lock it again, the second locking call results in deadlock. PTHREAD_MUTEX_NORMAL_NP 2. Recursive: It allows a single thread to lock mutex more than one time. Each time thread lock the mutex, a lock counter is incremented. Each unlock decrements the counter. Before another thread can lock an instance of this type of mutex, the locking thread must call the pthread_mutex_unlock() routine the same number of time that it called the pthread_mutex_unlock() routine the same number of times that it called the pthread_mutex_lock() routine. When thread successfully locks a recursive mutex, it owns that mutex and the lock count is set to 1. Any other thread attempting to lock the mutex blocks until the mutex becomes unlocked. If the owner of the mutex attempts to lock the mutex again, the lock ount is incremented, and the thread continues running. When a recursive mutex s owner unlocks it, the lock count is decremented. The mutex remains locked and owned until the count reaches zero. It is an error for any thread other than the owner to attempt to unlock a recursive mutex. 15 P a g e

16 Shared Address Space Shared Address Space Parallel Processing IMP Questions PTHREAD_MUTEX_RECURSIVE_NP 3. Errorcheck mutex: This type of mutex is locked exactly once by a thread, like normal mutex. If a thread tries to lock the mutex again without first unlocking it or tries to unlock a mutex it does not own, the thread receives an error. PTHREAD_MUTEX_ERRORCHECK_NP Thus, errorcheck mutexes are more informative than normal mutexes, because normal mutexes deadlock in such a case, leaving the programmer to determine why the thread no longer executes. The function pthread_mutexattr_settype_np can be used for setting the type of mutex specified by the mutex attributes object. pthread_mutexattr_settype_np(pthread_mutexattr_t *attr, int type); 7) Explain thread creation, termination and cancellation in detail in shared-address-space parallel system. OR Explain following functions with respect to Pthreads API. Also discuss arguments of these functions. i. pthread_create() ii. pthread_join() [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 28 Marks] In multiuser system and protected environment process is less suitable. In this environment, light weight processes called threads performs faster manipulation in global memory. A thread can be defined as an independent sequential flow of program. One of the most popular header file for most of the thread functionalities is pthread.h For example, each iteration of calculating dot product can be consider as thread based on following syntax. C[i][j] = create_thread(dot_product(get_row(a,i), get_col(b, j))); Each thread in above code is executed on different processors. Each requires to access elements of matrices A, B and C stored in shared address space. P M P P M P P Figure: The logical machine model of a thread-based programming paradigm As thread are executed as small functions, local variable of threads are treated as global data and are stored in memory blocks M as shown in above figure. As locality of data is important to improve performance, processes use cache memory to store local variables. Now we discuss about some of advantages and disadvantages as compared to messages passing paradigm with regards to following criteria. 16 P a g e M P

17 Thread creation: int pthread_create(pthread)t *thread_id, const pthread_attr_t *attr, void * (*start_function) (void *), void *arg); Thread join: int pthread_join(pthread_t thread_id, void **ptr); Thread Termination: int pthread_exit(void *value_ptr); Major Advantages of Thread based programming: 1. Software portability: It means migration from serial to parallel programming 2. Scheduling/ Load Balancing: Threaded programming provides large number of concurrent tasks which can be scheduled explicitly. Large number of concurrent tasks is mapped to multiple processors by dynamic mapping techniques to reduce overheads of communication and idling. 3. Latency hiding: When one thread is suffering from latency of memory other ready thread can execute its task by using CPU. 4. Ease of programming: It supports POSIX thread API which is the development tool for threaded programs. With the use of this tool programming with thread becomes very easy. 8) Explain parallel formulation of Dijkstra s algorithm for single source shortest path with an example. [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 28 Marks] For a weighted graph G = (V, E, w), the single-source shortest paths problem is to find the shortest paths from a vertex v V to all other vertices in V. A shortest path from u to v is a minimum-weight path. Depending on the application, edge weights may represent time, cost, penalty, loss, or any other quantity that accumulates additively along a path and is to be minimized. In the following section, we present Dijkstra's algorithm, which solves the singlesource shortest-paths problem on both directed and undirected graphs with non-negative weights. Dijkstra's algorithm, which finds the shortest paths from a single vertex s, is similar to Prim's minimum spanning tree algorithm. Dijkstra s single short shortest path algorithm: 17 P a g e

18 Like Prim's algorithm, it finds the shortest paths from s to the other vertices of G. It is also greedy; that is, it always chooses an edge to a vertex that appears closest. Comparing this algorithm with Prim's minimum spanning tree algorithm, we see that the two are almost identical. The main difference is that, for each vertex u (V - VT ), Dijkstra's algorithm stores l[u], the minimum cost to reach vertex u from vertex s by means of vertices in VT; Prim's algorithm stores d [u], the cost of the minimum-cost edge connecting a vertex in VT to u. The run time of Dijkstra's algorithm is Ѳ(n2). Parallel Formulation: The parallel formulation of Dijkstra's single-source shortest path algorithm is very similar to the parallel formulation of Prim's algorithm for minimum spanning trees. The weighted adjacency matrix is partitioned using the 1-D block mapping. Each of the p processes is assigned n/p consecutive columns of the weighted adjacency matrix, and computes n/p values of the array l. During each iteration, all processes perform computation and communication similar to that performed by the parallel formulation of Prim's algorithm. Consequently, the parallel performance and scalability of Dijkstra's single-source shortest path algorithm is identical to that of Prim's minimum spanning tree algorithm. 9) Differentiate blocking and non-blocking message passing operations. OR Explain the blocking message passing send and receive operation. OR Discuss buffered non-blocking and non-buffered non-blocking send/receive message passing opeerations. [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(4 marks), Total: 25 Marks] Interactions among process of parallel computer can be performed by sending and receiving messages. Prototype declaration for send and receive is as follows: Send(void *send_buf, int n_elems, int destination) Receive(void *recv_buf, int n_elems, int source) *send_buf is the pointer to the buffer that contains data to be sent. For Send, n_elems is the number of elements from buffer to be sent and destination is the identifier of process which receives data. 18 P a g e

19 *recv_buf is the pointer to the buffer that stores received data. For Receive, n_elems is the number of elements from buffer to be received and source is the idenfier of process which sends data. Following example illustrates the how process sends a piece of data to another process. P0 a = 10; send(&a, 1, 1); a = 0; P1 receive(&a, 1, 0); printf( %d\n,a); In above code process p0 sends value of variable a to process p1. After sending value p0 immediately change the value of a to zero. P1 receives value of a from p0 and then prints that value. P1 should receive 10 instead of 0. Message passing platforms have additional hardware to support these operations such as DMA and Network interface. DMA and Network interface allows transfer of message from buffer memory to destination without involvement of CPU. 10) Explain Cache Coherence in Multiprocessor Systems. OR Explain invalidate protocol used for cache coherence in multiprocessor system. OR What is the meaning of memory latency? How memory latency 19 P a g e

20 can be improved by Cache? [Sum 14(7 marks), Win 13(7 marks), Sum 12(7 marks), Total: 21 Marks] Memory latency: At the logical level, a memory system, possibly consisting of multiple levels of caches, takes in a request for a memory word and returns a block of data of size b containing the requested word after l nanoseconds. Here, l is referred to as the latency of the memory. Example: Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with latency 100 ns (no caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The peak processor rating is therefore 4 GFLOPS. Since the memory latency is equal to 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. Consider the program of computing the dot-product of two vectors. A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. So the peak speed of this computation is limited to one floating point operation at every 100 ns, or a speed of 10 MFLOPS which is very low than peek processor rating. Solution: Handling the mismatch in processor and DRAM speeds has motivated a number of architectural innovations in memory system design. One such innovation addresses the speed mismatch by placing a smaller and faster memory between the processor and the DRAM. This memory, referred to as the cache, acts as low-latency high-bandwidth storage. The data needed by the processor is first fetched into the cache. All subsequent accesses to data items residing in the cache are serviced by the cache. Thus, in principle, if a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. In our above example of 1 GHz processor with 100 ns latency DRAM if we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. This corresponds to a peak computation rate of 303 MFLOPS approximately, although it is still less than 10% of the peak processor performance. We can see in this example that by placing a small cache memory, we are able to improve processor utilization. 11) Define Isoefficiency function and derive equation of it. [Win 14(7 marks), Sum 13(7 marks), Sum 12(7 marks), Total: 21 Marks] Parallel execution time can be expressed as a function of problem size, overhead function, and the number of processing elements. We can write parallel runtime as: The resulting expression for speedup is 20 P a g e

21 Finally, we write the expression for efficiency as In above equation of E, if the problem size is kept constant and p is increased, the efficiency decreases because the total overhead To increases with p. If W is increased keeping the number of processing elements fixed, then for scalable parallel systems, the efficiency increases. This is because To grows slower than Q(W) for a fixed p. For these parallel systems, efficiency can be maintained at a desired value (between 0 and 1) for increasing p, provided W is also increased. For different parallel systems, W must be increased at different rates with respect to p in order to maintain a fixed efficiency. For instance, in some cases, W might need to grow as an exponential function of p to keep the efficiency from dropping as p increases. Such parallel systems are poorly scalable. The reason is that on these parallel systems it is difficult to obtain good speedups for a large number of processing elements unless the problem size is enormous. On the other hand, if W needs to grow only linearly with respect to p, then the parallel system is highly scalable. That is because it can easily deliver speedups proportional to the number of processing elements for reasonable problem sizes. For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To/W in Equation of E is maintained at a constant value. For a desired value E of efficiency, Let K = E/(1 - E) be a constant depending on the efficiency to be maintained. Since To is a function of W and p, Above equation of W can be rewritten as In above equation the problem size W can usually be obtained as a function of p by algebraic manipulations. 21 P a g e

22 This function dictates the growth rate of W required to keep the efficiency fixed as p increases. We call this function the isoefficiency function of the parallel system. The isoefficiency function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements. 12) Explain Matrix-Multiplication using DNS Algorithm. [Win 14(7 marks), Sum 13(7 marks), Sum 12(7 marks), Total: 21 Marks] DNS can use up to n 3 processes and that performs matrix multiplication in time ɵ(log n) by using ɵ(n3/log n) processes. This algorithm is known as the DNS algorithm because it is due to Dekel, Nassimi, and Sahni. Parallel formulation: The process arrangement can be visualized as n planes of n x n processes each. Each plane corresponds to a different value of k. Initially, as shown in Figure, the matrices are distributed among the n 2 processes of the plane corresponding to k = 0 at the base of the three-dimensional process array. Process P i,j,0 initially owns A[i, j] and B[i, j]. In figure the communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes. The shaded processes in part (c) store elements of the first row of A and the shaded processes in part (d) store elements of the first column of B. 22 P a g e

23 The vertical column of processes P i,j,* computes the dot product of row A[i, *] and column B[*, j]. Therefore, rows of A and columns of B need to be moved appropriately so that each vertical column of processes P i,j,* has row A[i, *] and column B[*, j]. More precisely, process P i,j,k should have A[i, k] and B[k, j]. For matrix A: 23 P a g e

24 The communication pattern for distributing the elements of matrix A among the processes is shown in Figure (a) (c). First, each column of A moves to a different plane such that the j th column occupies the same position in the plane corresponding to k = j as it initially did in the plane corresponding to k = 0. The distribution of A after moving A[i, j] from P i,j,0 to P i,j,j is shown in Figure (b). Now all the columns of A are replicated n times in their respective planes by a parallel one-to-all broadcast along the j axis. The result of this step is shown in Figure (c), in which the n processes P i,0,j, P i,1,j,..., P i,n-1,j receive a copy of A[i, j] from P i,j,j. At this point, each vertical column of processes P i,j,* has row A[i, *]. More precisely, process P i,j,k has A[i, k]. For matrix B: The communication steps are similar, but the roles of i and j in process subscripts are switched. In the first one-to-one communication step, B[i, j] is moved from P i,j,0 to P i,j,i. Then it is broadcast from P i,j,i among P 0,j,i, P 1,j,i,..., P n-1,j,i. The distribution of B after this one-to-all broadcast along the i axis is shown in Figure (d). At this point, each vertical column of processes P i,j,* has column B[*, j]. Now process P i,j,k has B[k, j], in addition to A[i, k]. After these communication steps, A[i, k] and B[k, j] are multiplied at P i,j,k. Now each element C[i, j] of the product matrix is obtained by an all-to-one reduction along the k axis. During this step, process Pi,j,0 accumulates the results of the multiplication from processes P i,j,1,..., P i,j,n-1. Figure shows this step for C[0, 0]. Three main communication steps: (1) moving the columns of A and the rows of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A and along the i axis for B, and (3) All-to-one reduction along the k axis. All these operations are performed within groups of n processes and take time ɵ(log n). Thus, the parallel run time for multiplying two n x n matrices using the DNS algorithm on n 3 processes is ɵ(log n). 24 P a g e