# Parallel Processing IMP Questions

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 11 2 Explain Cache Coherence in Multiprocessor Systems. OR Explain invalidate protocol used for cache coherence in multiprocessor system. OR What is the meaning of memory latency? How memory latency can be improved by Cache? Define Isoefficiency function and derive equation of it P a g e

3 1) What is Data Decomposition? Explain Data Decomposition with proper example. OR What is decomposition of task? Explain data decomposition in detail. OR Explain Recursive Decomposition technique to find minimum number from array. Draw task dependency graph for following data. 5, 12, 11, 1, 10, 6, 8, 3, 7, 4, 9, 2. [Sum 14(14 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] 1. Recursive Decomposition: This decomposition technique uses divide and conquer strategy. In this tech., a problem is solved by first dividing into set of independent sub programs. Such subprogram is solved by recursively applying similar division into smaller subprograms and so on. Ex : Quick-sort In above example an array A of n elements is sorted using quick-sort. It selects a pivot element X and partition A into 2 sub-arrays A0 and A1 such that all the elements in A0 are smaller than x and all the elements in A1 are greater than or equal to x. This partitioning step forms the divide step of the Algorithm. Each one of the subsequences A0 and A1 is sorted by recursively calling quick sort. Each one of these recursive calls further partitions the sub-arrays. 2. Data Decomposition: Decomposition of computation is done in 2 steps. Data on which computations are performed is partitioned. Partition of data is used to partition the computation into several tasks. Partitioning Output Data : In some computations, each element of o/p can be computed independently, so, partitioning the o/p data automatically creates decomposition of problem into tasks. Ex : Matrix Multiplication : Consider the problem of multiplying two n x n matrices A and B to yield a matrix C. Figure shows a decomposition of this problem into four tasks. The decomposition shown in Figure is based on partitioning the output matrix C into four sub matrices and each of the four tasks computes one of these sub matrices. 3 P a g e

4 Partitioning Input Data : In case of finding minimum, maximum or sum of array, o/p is not a known value. In such cases it is possible to partitioning i/p data. Task is created for each partition of i/p data and introduce concurrency.not directly solved, a followup is needed to combine the results. A task is created for each partition of the input data and this task performs as much computation as possible using these local data. The problem of determining the minimum number of a set of itemsets { 4,9,1,7,8,11,12,2 } can also be decomposed based on a partitioning of input data. Figure shows decomposition based on a partitioning of the input set of transactions. 4,9 1,7 8,11 12,2 4,1 8,2 1,2 Partitioning Both (i/o & o/p) Data : In some cases, in which it is possible to partition the output data, partitioning of input data can offer additional concurrency. Ex : Relational Database of Vehicals for processing the following query : MODEL="Civic" AND YEAR="2001" AND (COLOR="Green" OR COLOR="White") Partitioning Intermediate Data : Algorithms are often structured as multi-stage computations such that the output of one stage is the input to the subsequent stage. A decomposition of such an Algorithm can be derived by partitioning the input or the output data of an intermediate stage of the Algorithm. Partitioning intermediate data can sometimes lead to higher concurrency than partitioning input or output data. Let us revisit matrix multiplication to illustrate a decomposition based on partitioning intermediate data. Decompositions induced by a 2 x 2 partitioning of the output matrix C, have a maximum degree of concurrency of four. Degree of concurrency can be increased by introducing an intermediate stage in which eight tasks 1 4 P a g e

5 compute their respective product sub matrices and store the results in a temporary threedimensional matrix D, as shown in figure. The sub matrix Dk,i,j is the product of Ai,k and Bk,j. A partitioning of the intermediate matrix D induces a decomposition into eight tasks.after the multiplication phase, a relatively inexpensive matrix addition step can compute the result matrix C. 2) Explain bitonic sort. OR Discuss mapping of bitonic sort algorithm to a hypercube and a mesh. OR Write two rules for bitonic sequence in bitonic sorting network, explain the same with example. Briefly discuss bitonic sort and trace the following sequence using the same. 3, 5, 8, 9, 10, 12, 14, 20, 95, 90, 60, 40, 35 23, 18, 0 [Sum 14(14 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] Ans 1 - Bitonic sort: A bitonic sorting network sorts n elements in O(log 2 n) time. The key operation of the bitonic sorting network is the rearrangement of a bitonic sequence into a sorted sequence. A bitonic sequence is a sequence of elements <a 0, a 1,..., a n-1 > with the property that either : (1) There exists an index i, 0 i n-1, such that < a 0,., a i > is monotonically increasing and < a i+1,.,a n-1 > is monotonically decreasing, or (2) There exists a cyclic shift of indices so that (1) is satisfied The method that rearranges bitonic sequence to obtain monotonically increasing order is called bitonic sort. Let s = <a 0, a 1,..., a n-1 > be a bitonic sequence such that a 0 <=a 1 <=..., a n/2-1 and a n/2 <= a n/2+1 <=... <= a n-1. Consider the following subsequences of s: 5 P a g e

6 bi = min{ai, an/2+i } such that all the elements before bi are from the increasing part of the original sequence and all the elements after bi are from the decreasing part. bi = max{ai, an/2+i} is such that all the elements before bi are from the decreasing part of the original sequence and all the elements after are from the increasing part. Thus, the sequences s1 and s2 are bitonic sequences. Furthermore, every element of the first sequence is smaller than every element of the second sequence. The reason is that bi is greater than or equal to all elements of s1, is less than or equal to all elements of s2, and is greater than or equal to bi. Solution: Solution of sorting or rearranging bitonic sequence of size n by rearranging two smaller bitonic sequences and concatenating them. We refer to the operation of splitting a bitonic sequence of size n into the two bitonic sequences as a bitonic split. We can recursively obtain shorter bitonic sub-sequence using equation of S1 and S2; until we obtain subsequence of size one. Number of splits required to rearrange the bitonic sequence into sorted sequence is log n. This procedure of merging the sorting sequence is called bitonic merge shown below in figure. Example: Merging of bitonic sequence is easy to implement on network of comparators, called bitonic merging network. This network contains log n columns, each column contains n/2 comparators and performs 1 step of bitonic merge, and takes bitonic sequence as i/p and gives sorted sequence as o/p. 6 P a g e

7 If we replace same n/w with decreasing comparator, then the i/p data will be sorted in monotonically decreasing order. Ans 2 - Mapping Bitonic Sort to a Hypercube and Mesh: 1. Hypercube: Compare-exchange scenario: In this mapping, each of the n processes contains one element of the input sequence. Graphically, each wire of the bitonic sorting network represents a distinct process. During each step of the algorithm, the compare-exchange operations performed by a column of comparators are performed by n/2 pairs of processes. If the mapping is poor, the elements travel a long distance before they can be compared. Ideally, wires that perform a compare-exchange should be mapped onto neighboring processes. In any step, the compare-exchange operation is performed between two wires only if their labels differ in exactly one bit. Processes are paired for their compare-exchange steps in a d-dimensional hypercube (that is, p = 2d). In the final stage of bitonic sort, the input has been converted into a bitonic sequence. Two Rules: During the first rule of this stage, processes that differ only in the d th bit of the binary representation of their labels (that is, the most significant bit) compare-exchange their elements. Thus, the compare-exchange operation takes place between processes along the d th dimension. Similarly, during the second rule of the algorithm, the compare-exchange operation takes place among the processes along the (d - 1)th dimension. Figure shows the last stage of process. 7 P a g e

8 The bitonic sort algorithm for a hypercube is shown below. The algorithm relies on the functions comp_exchange_max(i) and comp_exchange_min(i). These functions compare the local element with the element on the nearest process along the ith dimension and retain either the minimum or the maximum of the two elements procedure BITONIC_SORT(label, d) begin for i := 0 to d - 1 do for j := i downto 0 do if (i + 1) st bit of label j th bit of label then comp_exchange max(j); else comp_exchange min(j); end BITONIC_SORT Mesh : The connectivity of a mesh is lower than that of a hypercube, so it is impossible to map wires to processes such that each compare-exchange operation occurs only between neighboring processes. There are several ways to map the input wires onto the mesh processes. Some of these are illustrated in Figure. Each process in this figure is labeled by the wire that is mapped onto it. 8 P a g e

9 Figure (a) shows row-major mapping, (b) shows row-major snakelike mapping, and (c) shows row-major shuffled mapping. The compare exchange steps of the last stage of bitonic sort for the row-major shuffled mapping are shown in below Figure. 3) Explain parallel algorithm for Prim s algorithm and compare its complexity with the sequential algorithm for the same. [Win 14(7 marks), Sum 14(7 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight. If G is not connected, it cannot have a spanning tree. Prim's algorithm for finding an MST is a greedy algorithm. 9 P a g e

10 The algorithm begins by selecting an arbitrary starting vertex. It then grows the minimum spanning tree by choosing a new vertex and edge that are guaranteed to be in a spanning tree of minimum cost. The algorithm continues until all the vertices have been selected. Let G = (V, E, w) be the weighted undirected graph for which the minimum spanning tree is to be found, and let A = (ai, j) be its weighted adjacency matrix. The algorithm uses the set V T to hold the vertices of the minimum spanning tree during its construction. It also uses an array d[1..n] in which, for each vertex v (V - V T ), d [v] holds the weight of the edge with the least weight from any vertex in VT to vertex v. Each process Pi computes di[u]=min { d[v] / vє(v- V T )} during each iteration of while loop. The global minimum is then obtained over all di[u] by all-to-one reduction operation and sorted in Po. Po then inserts it to V T and broadcast u to all by one-to-all broadcast operation. 10 P a g e

11 Parallel formulation: The algorithm works in n outer iterations - it is hard to execute these iterations concurrently. The inner loop is relatively easy to parallelize. Let p be the number of processes, and let n be the number of vertices. Figure: The partitioning of the distance array d and the adjacency matrix A among p processes. The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned accordingly. In each step, a processor selects the locally closest node, followed by a global reduction to select globally closest node. This node is inserted into MST, and the choice broadcast to all processors. Each processor updates its part of the d vector locally. Time complexities for various operations: cost to select the minimum entry = O(n/p + log p). cost of a broadcast = O(log p). 11 P a g e

12 cost of local updation of the d vector = O(n/p). parallel time per iteration = O(n/p + log p). total parallel time is given = O(n 2 /p + n log p). corresponding iso-efficiency = O(p 2 log 2 p). 4) Enlist various performance metrics for parallel systems. Explain Speedup, Efficiency and total parallel Overhead in brief. [Win 14(7 marks), Sum 14(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 35 Marks] A number of metrics have been used based on the desired outcome of performance analysis. 1. Execution Time: The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. We denote the serial runtime by TS and the parallel runtime by TP. 2. Total Parallel Overhead: The overheads incurred by a parallel program are encapsulated into a single expression referred to as the overhead function. We define overhead function or total overhead of a parallel system as the total time collectively spent by all the processing elements over and above that required by the fastest known sequential algorithm for solving the same problem on a single processing element. We denote the overhead function of a parallel system by the symbol To. The total time spent in solving a problem summed over all processing elements is ptp. TS units of this time are spent performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by 3. Speedup: T 0 = pt p - T s When evaluating a parallel system, we are often interested in knowing how much performance gain is achieved by parallelizing a given application over a sequential implementation. Speedup is a measure that captures the relative benefit of solving a problem in parallel. It is defined as the ratio of the time taken to solve a problem on a single processing element to the time required to solve the same problem on a parallel computer with p identical processing elements. We denote speedup by the symbol S. Only an ideal parallel system containing p processing elements can deliver a speedup equal to p. In practice, ideal behavior is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm 4. Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed; it is defined as the ratio of speedup to the number of processing elements. 12 P a g e

13 In an ideal parallel system, speedup is equal to p and efficiency is equal to one. In practice, speedup is less than p and efficiency is between zero and one, depending on the effectiveness with which the processing elements are utilized. We denote efficiency by the symbol E. Mathematically, it is given by E = S / P 5) Explain odd-even sort in parallel environment and comment on its limitations. OR Discuss Odd-Even Transposition sort. OR Write and explain algorithm for Odd-Even transposition sort. Also sort following data 3, 2, 3, 8, 5, 6, 4, 1. [Win 14(7 marks), Sum 14(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 35 Marks] The odd-even transposition algorithm sorts n elements in n phases (n is even), each of which requires n/2 compare-exchange operations. This algorithm alternates between two phases, called Odd and Even phases Let <a 1, a 2,..., a n > be the sequence to be sorted. During the odd phase, elements with odd indices are compared with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs (a 1, a 2 ), (a 3, a 4 ),..., (a n-1, a n ) are compare-exchanged (assuming n is even). Similarly, during the even phase, elements with even indices are compared with their right neighbors. After n phases of odd-even exchanges, the sequence is sorted. During each phase of the algorithm, compare-exchange operations on pairs of elements are performed simultaneously. Example: Consider the one-element-per-process case. Let n be the number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-dimensional array. Element a i initially resides on process Pi for i = 1, 2,..., n. During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. Similarly, during the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor. This parallel formulation is presented in following Algorithm procedure ODD-EVEN_PAR (n) begin id := process's label for i := 1 to n do if i is odd then if id is odd then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); if i is even then if id is even then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); end for end ODD-EVEN_PAR P a g e

14 Odd-even sort in parallel environment: It is easy to parallelize odd-even transposition sort. During each phase of the algorithm, compareexchange operations on pairs of elements are performed simultaneously. Consider the one-element-per-process case. Let n be the number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-dimensional array. Element ai initially resides on process Pi for i = 1, 2,..., n. During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. The odd-even transposition sort is shown in following Figure. Similarly, during the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor. This parallel formulation is presented in following Algorithm. During each phase of the algorithm, the odd or even processes perform a compare- exchange step with their right neighbors A total of n such phases are performed; thus, the parallel run time of this formulation is Q(n). 6) Explain mutual exclusion for shared variable in Pthreads. OR Explain three types of mutex normal, recursive and error check in context to Pthread. [Win 14(7 marks), Sum 14(7 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(2 marks), Total: 30 Marks] If multiple threads asks for same region at the same time then race condition occurs. But various API provides mutual exclusion locks also known as mutex-lock for such shared data. It has two states, locked and unlocked. The code that manipulate shared variable should have lock associated with it. 14 P a g e

18 Like Prim's algorithm, it finds the shortest paths from s to the other vertices of G. It is also greedy; that is, it always chooses an edge to a vertex that appears closest. Comparing this algorithm with Prim's minimum spanning tree algorithm, we see that the two are almost identical. The main difference is that, for each vertex u (V - VT ), Dijkstra's algorithm stores l[u], the minimum cost to reach vertex u from vertex s by means of vertices in VT; Prim's algorithm stores d [u], the cost of the minimum-cost edge connecting a vertex in VT to u. The run time of Dijkstra's algorithm is Ѳ(n2). Parallel Formulation: The parallel formulation of Dijkstra's single-source shortest path algorithm is very similar to the parallel formulation of Prim's algorithm for minimum spanning trees. The weighted adjacency matrix is partitioned using the 1-D block mapping. Each of the p processes is assigned n/p consecutive columns of the weighted adjacency matrix, and computes n/p values of the array l. During each iteration, all processes perform computation and communication similar to that performed by the parallel formulation of Prim's algorithm. Consequently, the parallel performance and scalability of Dijkstra's single-source shortest path algorithm is identical to that of Prim's minimum spanning tree algorithm. 9) Differentiate blocking and non-blocking message passing operations. OR Explain the blocking message passing send and receive operation. OR Discuss buffered non-blocking and non-buffered non-blocking send/receive message passing opeerations. [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(4 marks), Total: 25 Marks] Interactions among process of parallel computer can be performed by sending and receiving messages. Prototype declaration for send and receive is as follows: Send(void *send_buf, int n_elems, int destination) Receive(void *recv_buf, int n_elems, int source) *send_buf is the pointer to the buffer that contains data to be sent. For Send, n_elems is the number of elements from buffer to be sent and destination is the identifier of process which receives data. 18 P a g e

19 *recv_buf is the pointer to the buffer that stores received data. For Receive, n_elems is the number of elements from buffer to be received and source is the idenfier of process which sends data. Following example illustrates the how process sends a piece of data to another process. P0 a = 10; send(&a, 1, 1); a = 0; P1 receive(&a, 1, 0); printf( %d\n,a); In above code process p0 sends value of variable a to process p1. After sending value p0 immediately change the value of a to zero. P1 receives value of a from p0 and then prints that value. P1 should receive 10 instead of 0. Message passing platforms have additional hardware to support these operations such as DMA and Network interface. DMA and Network interface allows transfer of message from buffer memory to destination without involvement of CPU. 10) Explain Cache Coherence in Multiprocessor Systems. OR Explain invalidate protocol used for cache coherence in multiprocessor system. OR What is the meaning of memory latency? How memory latency 19 P a g e

20 can be improved by Cache? [Sum 14(7 marks), Win 13(7 marks), Sum 12(7 marks), Total: 21 Marks] Memory latency: At the logical level, a memory system, possibly consisting of multiple levels of caches, takes in a request for a memory word and returns a block of data of size b containing the requested word after l nanoseconds. Here, l is referred to as the latency of the memory. Example: Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with latency 100 ns (no caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The peak processor rating is therefore 4 GFLOPS. Since the memory latency is equal to 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. Consider the program of computing the dot-product of two vectors. A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. So the peak speed of this computation is limited to one floating point operation at every 100 ns, or a speed of 10 MFLOPS which is very low than peek processor rating. Solution: Handling the mismatch in processor and DRAM speeds has motivated a number of architectural innovations in memory system design. One such innovation addresses the speed mismatch by placing a smaller and faster memory between the processor and the DRAM. This memory, referred to as the cache, acts as low-latency high-bandwidth storage. The data needed by the processor is first fetched into the cache. All subsequent accesses to data items residing in the cache are serviced by the cache. Thus, in principle, if a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. In our above example of 1 GHz processor with 100 ns latency DRAM if we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. This corresponds to a peak computation rate of 303 MFLOPS approximately, although it is still less than 10% of the peak processor performance. We can see in this example that by placing a small cache memory, we are able to improve processor utilization. 11) Define Isoefficiency function and derive equation of it. [Win 14(7 marks), Sum 13(7 marks), Sum 12(7 marks), Total: 21 Marks] Parallel execution time can be expressed as a function of problem size, overhead function, and the number of processing elements. We can write parallel runtime as: The resulting expression for speedup is 20 P a g e

21 Finally, we write the expression for efficiency as In above equation of E, if the problem size is kept constant and p is increased, the efficiency decreases because the total overhead To increases with p. If W is increased keeping the number of processing elements fixed, then for scalable parallel systems, the efficiency increases. This is because To grows slower than Q(W) for a fixed p. For these parallel systems, efficiency can be maintained at a desired value (between 0 and 1) for increasing p, provided W is also increased. For different parallel systems, W must be increased at different rates with respect to p in order to maintain a fixed efficiency. For instance, in some cases, W might need to grow as an exponential function of p to keep the efficiency from dropping as p increases. Such parallel systems are poorly scalable. The reason is that on these parallel systems it is difficult to obtain good speedups for a large number of processing elements unless the problem size is enormous. On the other hand, if W needs to grow only linearly with respect to p, then the parallel system is highly scalable. That is because it can easily deliver speedups proportional to the number of processing elements for reasonable problem sizes. For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To/W in Equation of E is maintained at a constant value. For a desired value E of efficiency, Let K = E/(1 - E) be a constant depending on the efficiency to be maintained. Since To is a function of W and p, Above equation of W can be rewritten as In above equation the problem size W can usually be obtained as a function of p by algebraic manipulations. 21 P a g e

22 This function dictates the growth rate of W required to keep the efficiency fixed as p increases. We call this function the isoefficiency function of the parallel system. The isoefficiency function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements. 12) Explain Matrix-Multiplication using DNS Algorithm. [Win 14(7 marks), Sum 13(7 marks), Sum 12(7 marks), Total: 21 Marks] DNS can use up to n 3 processes and that performs matrix multiplication in time ɵ(log n) by using ɵ(n3/log n) processes. This algorithm is known as the DNS algorithm because it is due to Dekel, Nassimi, and Sahni. Parallel formulation: The process arrangement can be visualized as n planes of n x n processes each. Each plane corresponds to a different value of k. Initially, as shown in Figure, the matrices are distributed among the n 2 processes of the plane corresponding to k = 0 at the base of the three-dimensional process array. Process P i,j,0 initially owns A[i, j] and B[i, j]. In figure the communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes. The shaded processes in part (c) store elements of the first row of A and the shaded processes in part (d) store elements of the first column of B. 22 P a g e

23 The vertical column of processes P i,j,* computes the dot product of row A[i, *] and column B[*, j]. Therefore, rows of A and columns of B need to be moved appropriately so that each vertical column of processes P i,j,* has row A[i, *] and column B[*, j]. More precisely, process P i,j,k should have A[i, k] and B[k, j]. For matrix A: 23 P a g e

24 The communication pattern for distributing the elements of matrix A among the processes is shown in Figure (a) (c). First, each column of A moves to a different plane such that the j th column occupies the same position in the plane corresponding to k = j as it initially did in the plane corresponding to k = 0. The distribution of A after moving A[i, j] from P i,j,0 to P i,j,j is shown in Figure (b). Now all the columns of A are replicated n times in their respective planes by a parallel one-to-all broadcast along the j axis. The result of this step is shown in Figure (c), in which the n processes P i,0,j, P i,1,j,..., P i,n-1,j receive a copy of A[i, j] from P i,j,j. At this point, each vertical column of processes P i,j,* has row A[i, *]. More precisely, process P i,j,k has A[i, k]. For matrix B: The communication steps are similar, but the roles of i and j in process subscripts are switched. In the first one-to-one communication step, B[i, j] is moved from P i,j,0 to P i,j,i. Then it is broadcast from P i,j,i among P 0,j,i, P 1,j,i,..., P n-1,j,i. The distribution of B after this one-to-all broadcast along the i axis is shown in Figure (d). At this point, each vertical column of processes P i,j,* has column B[*, j]. Now process P i,j,k has B[k, j], in addition to A[i, k]. After these communication steps, A[i, k] and B[k, j] are multiplied at P i,j,k. Now each element C[i, j] of the product matrix is obtained by an all-to-one reduction along the k axis. During this step, process Pi,j,0 accumulates the results of the multiplication from processes P i,j,1,..., P i,j,n-1. Figure shows this step for C[0, 0]. Three main communication steps: (1) moving the columns of A and the rows of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A and along the i axis for B, and (3) All-to-one reduction along the k axis. All these operations are performed within groups of n processes and take time ɵ(log n). Thus, the parallel run time for multiplying two n x n matrices using the DNS algorithm on n 3 processes is ɵ(log n). 24 P a g e