Parallel Processing IMP Questions

Size: px
Start display at page:

Download "Parallel Processing IMP Questions"

Transcription

1 Winter 14 Summer 14 Winter 13 Summer Parallel Processing IMP Questions Sr Chapter Questions Total What is Data Decomposition? Explain Data Decomposition with proper example. OR What is decomposition of task? Explain data decomposition in detail. OR Explain Recursive Decomposition technique to find minimum number from array. Draw task dependency graph for following data. 4, 9, 2, 6, 1, 7, 8, 11, 5, 3, 2, 12 Explain Bitonic sort with example. OR Discuss mapping of bitonic sort algorithm to a hypercube and a mesh. OR Write two rules for bitonic sequence in bitonic sorting network, explain the same with example. Briefly discuss bitonic sort and trace the following sequence using the same. 3, 5, 8, 9, 10, 12, 14, 20, 95, 90, 60, 40, 35, 23, 18, 0 Explain Prim s algorithm for minimum spanning tree. OR Explain parallel algorithm for Prim s algorithm and compare its complexity with the sequential algorithm for the same. Explain Odd-Even Transposition sort Algorithm. OR Write and Explain Algorithm for ODD-EVEN Transposition Sort. Also sort following Data. 1, 3, 8, 2, 9, 4, 6, 5 OR Explain odd-even sort in parallel environment and comment on its limitations. In context to Pthread, Explain normal, recursive and error check mutex. OR Write short note on how mutex-locks and condition variable is use for synchronizing shared data in Shared-address-space programming in Pthreads. OR Explain mutual exclusion for shared variable in Pthreads Enlist various performance metrics for parallel systems. Explain Speedup, Efficiency and total parallel overhead in brief Explain Matrix-Multiplication using DNS Algorithm Explain thread creation, termination and cancellation in detail in shared-address-space parallel system. OR Explain following functions with respect to Pthreads API. Also discuss arguments of these functions. i. pthread_create() ii. pthread_join() Explain parallel formulations of Dijkstra s algorithm Discuss buffered non-blocking and non-buffered non-blocking 10 6 send/receive message passing operations with suitable diagram P a g e

2 11 2 Explain Cache Coherence in Multiprocessor Systems. OR Explain invalidate protocol used for cache coherence in multiprocessor system. OR What is the meaning of memory latency? How memory latency can be improved by Cache? Define Isoefficiency function and derive equation of it P a g e

3 1) What is Data Decomposition? Explain Data Decomposition with proper example. OR What is decomposition of task? Explain data decomposition in detail. OR Explain Recursive Decomposition technique to find minimum number from array. Draw task dependency graph for following data. 5, 12, 11, 1, 10, 6, 8, 3, 7, 4, 9, 2. [Sum 14(14 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] 1. Recursive Decomposition: This decomposition technique uses divide and conquer strategy. In this tech., a problem is solved by first dividing into set of independent sub programs. Such subprogram is solved by recursively applying similar division into smaller subprograms and so on. Ex : Quick-sort In above example an array A of n elements is sorted using quick-sort. It selects a pivot element X and partition A into 2 sub-arrays A0 and A1 such that all the elements in A0 are smaller than x and all the elements in A1 are greater than or equal to x. This partitioning step forms the divide step of the Algorithm. Each one of the subsequences A0 and A1 is sorted by recursively calling quick sort. Each one of these recursive calls further partitions the sub-arrays. 2. Data Decomposition: Decomposition of computation is done in 2 steps. Data on which computations are performed is partitioned. Partition of data is used to partition the computation into several tasks. Partitioning Output Data : In some computations, each element of o/p can be computed independently, so, partitioning the o/p data automatically creates decomposition of problem into tasks. Ex : Matrix Multiplication : Consider the problem of multiplying two n x n matrices A and B to yield a matrix C. Figure shows a decomposition of this problem into four tasks. The decomposition shown in Figure is based on partitioning the output matrix C into four sub matrices and each of the four tasks computes one of these sub matrices. 3 P a g e

4 Partitioning Input Data : In case of finding minimum, maximum or sum of array, o/p is not a known value. In such cases it is possible to partitioning i/p data. Task is created for each partition of i/p data and introduce concurrency.not directly solved, a followup is needed to combine the results. A task is created for each partition of the input data and this task performs as much computation as possible using these local data. The problem of determining the minimum number of a set of itemsets { 4,9,1,7,8,11,12,2 } can also be decomposed based on a partitioning of input data. Figure shows decomposition based on a partitioning of the input set of transactions. 4,9 1,7 8,11 12,2 4,1 8,2 1,2 Partitioning Both (i/o & o/p) Data : In some cases, in which it is possible to partition the output data, partitioning of input data can offer additional concurrency. Ex : Relational Database of Vehicals for processing the following query : MODEL="Civic" AND YEAR="2001" AND (COLOR="Green" OR COLOR="White") Partitioning Intermediate Data : Algorithms are often structured as multi-stage computations such that the output of one stage is the input to the subsequent stage. A decomposition of such an Algorithm can be derived by partitioning the input or the output data of an intermediate stage of the Algorithm. Partitioning intermediate data can sometimes lead to higher concurrency than partitioning input or output data. Let us revisit matrix multiplication to illustrate a decomposition based on partitioning intermediate data. Decompositions induced by a 2 x 2 partitioning of the output matrix C, have a maximum degree of concurrency of four. Degree of concurrency can be increased by introducing an intermediate stage in which eight tasks 1 4 P a g e

5 compute their respective product sub matrices and store the results in a temporary threedimensional matrix D, as shown in figure. The sub matrix Dk,i,j is the product of Ai,k and Bk,j. A partitioning of the intermediate matrix D induces a decomposition into eight tasks.after the multiplication phase, a relatively inexpensive matrix addition step can compute the result matrix C. 2) Explain bitonic sort. OR Discuss mapping of bitonic sort algorithm to a hypercube and a mesh. OR Write two rules for bitonic sequence in bitonic sorting network, explain the same with example. Briefly discuss bitonic sort and trace the following sequence using the same. 3, 5, 8, 9, 10, 12, 14, 20, 95, 90, 60, 40, 35 23, 18, 0 [Sum 14(14 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] Ans 1 - Bitonic sort: A bitonic sorting network sorts n elements in O(log 2 n) time. The key operation of the bitonic sorting network is the rearrangement of a bitonic sequence into a sorted sequence. A bitonic sequence is a sequence of elements <a 0, a 1,..., a n-1 > with the property that either : (1) There exists an index i, 0 i n-1, such that < a 0,., a i > is monotonically increasing and < a i+1,.,a n-1 > is monotonically decreasing, or (2) There exists a cyclic shift of indices so that (1) is satisfied The method that rearranges bitonic sequence to obtain monotonically increasing order is called bitonic sort. Let s = <a 0, a 1,..., a n-1 > be a bitonic sequence such that a 0 <=a 1 <=..., a n/2-1 and a n/2 <= a n/2+1 <=... <= a n-1. Consider the following subsequences of s: 5 P a g e

6 bi = min{ai, an/2+i } such that all the elements before bi are from the increasing part of the original sequence and all the elements after bi are from the decreasing part. bi = max{ai, an/2+i} is such that all the elements before bi are from the decreasing part of the original sequence and all the elements after are from the increasing part. Thus, the sequences s1 and s2 are bitonic sequences. Furthermore, every element of the first sequence is smaller than every element of the second sequence. The reason is that bi is greater than or equal to all elements of s1, is less than or equal to all elements of s2, and is greater than or equal to bi. Solution: Solution of sorting or rearranging bitonic sequence of size n by rearranging two smaller bitonic sequences and concatenating them. We refer to the operation of splitting a bitonic sequence of size n into the two bitonic sequences as a bitonic split. We can recursively obtain shorter bitonic sub-sequence using equation of S1 and S2; until we obtain subsequence of size one. Number of splits required to rearrange the bitonic sequence into sorted sequence is log n. This procedure of merging the sorting sequence is called bitonic merge shown below in figure. Example: Merging of bitonic sequence is easy to implement on network of comparators, called bitonic merging network. This network contains log n columns, each column contains n/2 comparators and performs 1 step of bitonic merge, and takes bitonic sequence as i/p and gives sorted sequence as o/p. 6 P a g e

7 If we replace same n/w with decreasing comparator, then the i/p data will be sorted in monotonically decreasing order. Ans 2 - Mapping Bitonic Sort to a Hypercube and Mesh: 1. Hypercube: Compare-exchange scenario: In this mapping, each of the n processes contains one element of the input sequence. Graphically, each wire of the bitonic sorting network represents a distinct process. During each step of the algorithm, the compare-exchange operations performed by a column of comparators are performed by n/2 pairs of processes. If the mapping is poor, the elements travel a long distance before they can be compared. Ideally, wires that perform a compare-exchange should be mapped onto neighboring processes. In any step, the compare-exchange operation is performed between two wires only if their labels differ in exactly one bit. Processes are paired for their compare-exchange steps in a d-dimensional hypercube (that is, p = 2d). In the final stage of bitonic sort, the input has been converted into a bitonic sequence. Two Rules: During the first rule of this stage, processes that differ only in the d th bit of the binary representation of their labels (that is, the most significant bit) compare-exchange their elements. Thus, the compare-exchange operation takes place between processes along the d th dimension. Similarly, during the second rule of the algorithm, the compare-exchange operation takes place among the processes along the (d - 1)th dimension. Figure shows the last stage of process. 7 P a g e

8 The bitonic sort algorithm for a hypercube is shown below. The algorithm relies on the functions comp_exchange_max(i) and comp_exchange_min(i). These functions compare the local element with the element on the nearest process along the ith dimension and retain either the minimum or the maximum of the two elements procedure BITONIC_SORT(label, d) begin for i := 0 to d - 1 do for j := i downto 0 do if (i + 1) st bit of label j th bit of label then comp_exchange max(j); else comp_exchange min(j); end BITONIC_SORT Mesh : The connectivity of a mesh is lower than that of a hypercube, so it is impossible to map wires to processes such that each compare-exchange operation occurs only between neighboring processes. There are several ways to map the input wires onto the mesh processes. Some of these are illustrated in Figure. Each process in this figure is labeled by the wire that is mapped onto it. 8 P a g e

9 Figure (a) shows row-major mapping, (b) shows row-major snakelike mapping, and (c) shows row-major shuffled mapping. The compare exchange steps of the last stage of bitonic sort for the row-major shuffled mapping are shown in below Figure. 3) Explain parallel algorithm for Prim s algorithm and compare its complexity with the sequential algorithm for the same. [Win 14(7 marks), Sum 14(7 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 42 Marks] A minimum spanning tree (MST) for a weighted undirected graph is a spanning tree with minimum weight. If G is not connected, it cannot have a spanning tree. Prim's algorithm for finding an MST is a greedy algorithm. 9 P a g e

10 The algorithm begins by selecting an arbitrary starting vertex. It then grows the minimum spanning tree by choosing a new vertex and edge that are guaranteed to be in a spanning tree of minimum cost. The algorithm continues until all the vertices have been selected. Let G = (V, E, w) be the weighted undirected graph for which the minimum spanning tree is to be found, and let A = (ai, j) be its weighted adjacency matrix. The algorithm uses the set V T to hold the vertices of the minimum spanning tree during its construction. It also uses an array d[1..n] in which, for each vertex v (V - V T ), d [v] holds the weight of the edge with the least weight from any vertex in VT to vertex v. Each process Pi computes di[u]=min { d[v] / vє(v- V T )} during each iteration of while loop. The global minimum is then obtained over all di[u] by all-to-one reduction operation and sorted in Po. Po then inserts it to V T and broadcast u to all by one-to-all broadcast operation. 10 P a g e

11 Parallel formulation: The algorithm works in n outer iterations - it is hard to execute these iterations concurrently. The inner loop is relatively easy to parallelize. Let p be the number of processes, and let n be the number of vertices. Figure: The partitioning of the distance array d and the adjacency matrix A among p processes. The adjacency matrix is partitioned in a 1-D block fashion, with distance vector d partitioned accordingly. In each step, a processor selects the locally closest node, followed by a global reduction to select globally closest node. This node is inserted into MST, and the choice broadcast to all processors. Each processor updates its part of the d vector locally. Time complexities for various operations: cost to select the minimum entry = O(n/p + log p). cost of a broadcast = O(log p). 11 P a g e

12 cost of local updation of the d vector = O(n/p). parallel time per iteration = O(n/p + log p). total parallel time is given = O(n 2 /p + n log p). corresponding iso-efficiency = O(p 2 log 2 p). 4) Enlist various performance metrics for parallel systems. Explain Speedup, Efficiency and total parallel Overhead in brief. [Win 14(7 marks), Sum 14(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 35 Marks] A number of metrics have been used based on the desired outcome of performance analysis. 1. Execution Time: The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. We denote the serial runtime by TS and the parallel runtime by TP. 2. Total Parallel Overhead: The overheads incurred by a parallel program are encapsulated into a single expression referred to as the overhead function. We define overhead function or total overhead of a parallel system as the total time collectively spent by all the processing elements over and above that required by the fastest known sequential algorithm for solving the same problem on a single processing element. We denote the overhead function of a parallel system by the symbol To. The total time spent in solving a problem summed over all processing elements is ptp. TS units of this time are spent performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by 3. Speedup: T 0 = pt p - T s When evaluating a parallel system, we are often interested in knowing how much performance gain is achieved by parallelizing a given application over a sequential implementation. Speedup is a measure that captures the relative benefit of solving a problem in parallel. It is defined as the ratio of the time taken to solve a problem on a single processing element to the time required to solve the same problem on a parallel computer with p identical processing elements. We denote speedup by the symbol S. Only an ideal parallel system containing p processing elements can deliver a speedup equal to p. In practice, ideal behavior is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm 4. Efficiency Efficiency is a measure of the fraction of time for which a processing element is usefully employed; it is defined as the ratio of speedup to the number of processing elements. 12 P a g e

13 In an ideal parallel system, speedup is equal to p and efficiency is equal to one. In practice, speedup is less than p and efficiency is between zero and one, depending on the effectiveness with which the processing elements are utilized. We denote efficiency by the symbol E. Mathematically, it is given by E = S / P 5) Explain odd-even sort in parallel environment and comment on its limitations. OR Discuss Odd-Even Transposition sort. OR Write and explain algorithm for Odd-Even transposition sort. Also sort following data 3, 2, 3, 8, 5, 6, 4, 1. [Win 14(7 marks), Sum 14(7 marks), Sum 13(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 35 Marks] The odd-even transposition algorithm sorts n elements in n phases (n is even), each of which requires n/2 compare-exchange operations. This algorithm alternates between two phases, called Odd and Even phases Let <a 1, a 2,..., a n > be the sequence to be sorted. During the odd phase, elements with odd indices are compared with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs (a 1, a 2 ), (a 3, a 4 ),..., (a n-1, a n ) are compare-exchanged (assuming n is even). Similarly, during the even phase, elements with even indices are compared with their right neighbors. After n phases of odd-even exchanges, the sequence is sorted. During each phase of the algorithm, compare-exchange operations on pairs of elements are performed simultaneously. Example: Consider the one-element-per-process case. Let n be the number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-dimensional array. Element a i initially resides on process Pi for i = 1, 2,..., n. During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. Similarly, during the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor. This parallel formulation is presented in following Algorithm procedure ODD-EVEN_PAR (n) begin id := process's label for i := 1 to n do if i is odd then if id is odd then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); if i is even then if id is even then compare-exchange_min(id + 1); else compare-exchange_max(id - 1); end for end ODD-EVEN_PAR P a g e

14 Odd-even sort in parallel environment: It is easy to parallelize odd-even transposition sort. During each phase of the algorithm, compareexchange operations on pairs of elements are performed simultaneously. Consider the one-element-per-process case. Let n be the number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-dimensional array. Element ai initially resides on process Pi for i = 1, 2,..., n. During the odd phase, each process that has an odd label compare-exchanges its element with the element residing on its right neighbor. The odd-even transposition sort is shown in following Figure. Similarly, during the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor. This parallel formulation is presented in following Algorithm. During each phase of the algorithm, the odd or even processes perform a compare- exchange step with their right neighbors A total of n such phases are performed; thus, the parallel run time of this formulation is Q(n). 6) Explain mutual exclusion for shared variable in Pthreads. OR Explain three types of mutex normal, recursive and error check in context to Pthread. [Win 14(7 marks), Sum 14(7 marks), Win 13(7 marks), Sum 13(7 marks), Win 12(2 marks), Total: 30 Marks] If multiple threads asks for same region at the same time then race condition occurs. But various API provides mutual exclusion locks also known as mutex-lock for such shared data. It has two states, locked and unlocked. The code that manipulate shared variable should have lock associated with it. 14 P a g e

15 Thread that tries to update value of that variable should first acquire the lock on that variable. As only one lock on data is allowed at a time, no other thread can lock the same variable at the point of that time and so process that tries to lock already locked variables is blocked. Before leaving critical region value of locked variable should unlock. So other thread can update its value. At initial level all mutex locks are in unlocked state. Pthread API provides two functions to lock or unlock shared variables as follows: int pthread_mutex_lock(pthread_mutex_t *mutex_lock); int pthread_mutex_unlock(pthread_mutex_t *mutex_lock); If thread successfully locked some variable then it enters into critical section. If more than one blocked thread, then any one of them is allowd to enter in critical section based on scheduling policy. If any thread attempts to unlock the mutex that is already unlocked or which is locked by other thread then no effect is defined. Before using mutex, it should be initialized to unlock state by function pthread_mutex_init() as follows: int pthread_mutex_init(pthread_mutex_t pthread_mutexattrtype *lock_attr); *mutex_lock, const It is possible to reduce locking by another lock function which is called pthread_mutex_trylock(). int pthread_mutex_trylock(pthread_mutex_t *mutex_lock); If successfully locked it returns 0, otherwise it give that it is in busy state. Types of mutex: 1. Normal: It is default type locking. Only single thread is allowed to lock normal mutex once at any point in time. If thread with a lock attempts to lock it again, the second locking call results in deadlock. PTHREAD_MUTEX_NORMAL_NP 2. Recursive: It allows a single thread to lock mutex more than one time. Each time thread lock the mutex, a lock counter is incremented. Each unlock decrements the counter. Before another thread can lock an instance of this type of mutex, the locking thread must call the pthread_mutex_unlock() routine the same number of time that it called the pthread_mutex_unlock() routine the same number of times that it called the pthread_mutex_lock() routine. When thread successfully locks a recursive mutex, it owns that mutex and the lock count is set to 1. Any other thread attempting to lock the mutex blocks until the mutex becomes unlocked. If the owner of the mutex attempts to lock the mutex again, the lock ount is incremented, and the thread continues running. When a recursive mutex s owner unlocks it, the lock count is decremented. The mutex remains locked and owned until the count reaches zero. It is an error for any thread other than the owner to attempt to unlock a recursive mutex. 15 P a g e

16 Shared Address Space Shared Address Space Parallel Processing IMP Questions PTHREAD_MUTEX_RECURSIVE_NP 3. Errorcheck mutex: This type of mutex is locked exactly once by a thread, like normal mutex. If a thread tries to lock the mutex again without first unlocking it or tries to unlock a mutex it does not own, the thread receives an error. PTHREAD_MUTEX_ERRORCHECK_NP Thus, errorcheck mutexes are more informative than normal mutexes, because normal mutexes deadlock in such a case, leaving the programmer to determine why the thread no longer executes. The function pthread_mutexattr_settype_np can be used for setting the type of mutex specified by the mutex attributes object. pthread_mutexattr_settype_np(pthread_mutexattr_t *attr, int type); 7) Explain thread creation, termination and cancellation in detail in shared-address-space parallel system. OR Explain following functions with respect to Pthreads API. Also discuss arguments of these functions. i. pthread_create() ii. pthread_join() [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 28 Marks] In multiuser system and protected environment process is less suitable. In this environment, light weight processes called threads performs faster manipulation in global memory. A thread can be defined as an independent sequential flow of program. One of the most popular header file for most of the thread functionalities is pthread.h For example, each iteration of calculating dot product can be consider as thread based on following syntax. C[i][j] = create_thread(dot_product(get_row(a,i), get_col(b, j))); Each thread in above code is executed on different processors. Each requires to access elements of matrices A, B and C stored in shared address space. P M P P M P P Figure: The logical machine model of a thread-based programming paradigm As thread are executed as small functions, local variable of threads are treated as global data and are stored in memory blocks M as shown in above figure. As locality of data is important to improve performance, processes use cache memory to store local variables. Now we discuss about some of advantages and disadvantages as compared to messages passing paradigm with regards to following criteria. 16 P a g e M P

17 Thread creation: int pthread_create(pthread)t *thread_id, const pthread_attr_t *attr, void * (*start_function) (void *), void *arg); Thread join: int pthread_join(pthread_t thread_id, void **ptr); Thread Termination: int pthread_exit(void *value_ptr); Major Advantages of Thread based programming: 1. Software portability: It means migration from serial to parallel programming 2. Scheduling/ Load Balancing: Threaded programming provides large number of concurrent tasks which can be scheduled explicitly. Large number of concurrent tasks is mapped to multiple processors by dynamic mapping techniques to reduce overheads of communication and idling. 3. Latency hiding: When one thread is suffering from latency of memory other ready thread can execute its task by using CPU. 4. Ease of programming: It supports POSIX thread API which is the development tool for threaded programs. With the use of this tool programming with thread becomes very easy. 8) Explain parallel formulation of Dijkstra s algorithm for single source shortest path with an example. [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(7 marks), Total: 28 Marks] For a weighted graph G = (V, E, w), the single-source shortest paths problem is to find the shortest paths from a vertex v V to all other vertices in V. A shortest path from u to v is a minimum-weight path. Depending on the application, edge weights may represent time, cost, penalty, loss, or any other quantity that accumulates additively along a path and is to be minimized. In the following section, we present Dijkstra's algorithm, which solves the singlesource shortest-paths problem on both directed and undirected graphs with non-negative weights. Dijkstra's algorithm, which finds the shortest paths from a single vertex s, is similar to Prim's minimum spanning tree algorithm. Dijkstra s single short shortest path algorithm: 17 P a g e

18 Like Prim's algorithm, it finds the shortest paths from s to the other vertices of G. It is also greedy; that is, it always chooses an edge to a vertex that appears closest. Comparing this algorithm with Prim's minimum spanning tree algorithm, we see that the two are almost identical. The main difference is that, for each vertex u (V - VT ), Dijkstra's algorithm stores l[u], the minimum cost to reach vertex u from vertex s by means of vertices in VT; Prim's algorithm stores d [u], the cost of the minimum-cost edge connecting a vertex in VT to u. The run time of Dijkstra's algorithm is Ѳ(n2). Parallel Formulation: The parallel formulation of Dijkstra's single-source shortest path algorithm is very similar to the parallel formulation of Prim's algorithm for minimum spanning trees. The weighted adjacency matrix is partitioned using the 1-D block mapping. Each of the p processes is assigned n/p consecutive columns of the weighted adjacency matrix, and computes n/p values of the array l. During each iteration, all processes perform computation and communication similar to that performed by the parallel formulation of Prim's algorithm. Consequently, the parallel performance and scalability of Dijkstra's single-source shortest path algorithm is identical to that of Prim's minimum spanning tree algorithm. 9) Differentiate blocking and non-blocking message passing operations. OR Explain the blocking message passing send and receive operation. OR Discuss buffered non-blocking and non-buffered non-blocking send/receive message passing opeerations. [Win 14(7 marks), Sum 14(7 marks), Win 12(7 marks), Sum 12(4 marks), Total: 25 Marks] Interactions among process of parallel computer can be performed by sending and receiving messages. Prototype declaration for send and receive is as follows: Send(void *send_buf, int n_elems, int destination) Receive(void *recv_buf, int n_elems, int source) *send_buf is the pointer to the buffer that contains data to be sent. For Send, n_elems is the number of elements from buffer to be sent and destination is the identifier of process which receives data. 18 P a g e

19 *recv_buf is the pointer to the buffer that stores received data. For Receive, n_elems is the number of elements from buffer to be received and source is the idenfier of process which sends data. Following example illustrates the how process sends a piece of data to another process. P0 a = 10; send(&a, 1, 1); a = 0; P1 receive(&a, 1, 0); printf( %d\n,a); In above code process p0 sends value of variable a to process p1. After sending value p0 immediately change the value of a to zero. P1 receives value of a from p0 and then prints that value. P1 should receive 10 instead of 0. Message passing platforms have additional hardware to support these operations such as DMA and Network interface. DMA and Network interface allows transfer of message from buffer memory to destination without involvement of CPU. 10) Explain Cache Coherence in Multiprocessor Systems. OR Explain invalidate protocol used for cache coherence in multiprocessor system. OR What is the meaning of memory latency? How memory latency 19 P a g e

20 can be improved by Cache? [Sum 14(7 marks), Win 13(7 marks), Sum 12(7 marks), Total: 21 Marks] Memory latency: At the logical level, a memory system, possibly consisting of multiple levels of caches, takes in a request for a memory word and returns a block of data of size b containing the requested word after l nanoseconds. Here, l is referred to as the latency of the memory. Example: Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with latency 100 ns (no caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The peak processor rating is therefore 4 GFLOPS. Since the memory latency is equal to 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. Consider the program of computing the dot-product of two vectors. A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. So the peak speed of this computation is limited to one floating point operation at every 100 ns, or a speed of 10 MFLOPS which is very low than peek processor rating. Solution: Handling the mismatch in processor and DRAM speeds has motivated a number of architectural innovations in memory system design. One such innovation addresses the speed mismatch by placing a smaller and faster memory between the processor and the DRAM. This memory, referred to as the cache, acts as low-latency high-bandwidth storage. The data needed by the processor is first fetched into the cache. All subsequent accesses to data items residing in the cache are serviced by the cache. Thus, in principle, if a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. In our above example of 1 GHz processor with 100 ns latency DRAM if we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. This corresponds to a peak computation rate of 303 MFLOPS approximately, although it is still less than 10% of the peak processor performance. We can see in this example that by placing a small cache memory, we are able to improve processor utilization. 11) Define Isoefficiency function and derive equation of it. [Win 14(7 marks), Sum 13(7 marks), Sum 12(7 marks), Total: 21 Marks] Parallel execution time can be expressed as a function of problem size, overhead function, and the number of processing elements. We can write parallel runtime as: The resulting expression for speedup is 20 P a g e

21 Finally, we write the expression for efficiency as In above equation of E, if the problem size is kept constant and p is increased, the efficiency decreases because the total overhead To increases with p. If W is increased keeping the number of processing elements fixed, then for scalable parallel systems, the efficiency increases. This is because To grows slower than Q(W) for a fixed p. For these parallel systems, efficiency can be maintained at a desired value (between 0 and 1) for increasing p, provided W is also increased. For different parallel systems, W must be increased at different rates with respect to p in order to maintain a fixed efficiency. For instance, in some cases, W might need to grow as an exponential function of p to keep the efficiency from dropping as p increases. Such parallel systems are poorly scalable. The reason is that on these parallel systems it is difficult to obtain good speedups for a large number of processing elements unless the problem size is enormous. On the other hand, if W needs to grow only linearly with respect to p, then the parallel system is highly scalable. That is because it can easily deliver speedups proportional to the number of processing elements for reasonable problem sizes. For scalable parallel systems, efficiency can be maintained at a fixed value (between 0 and 1) if the ratio To/W in Equation of E is maintained at a constant value. For a desired value E of efficiency, Let K = E/(1 - E) be a constant depending on the efficiency to be maintained. Since To is a function of W and p, Above equation of W can be rewritten as In above equation the problem size W can usually be obtained as a function of p by algebraic manipulations. 21 P a g e

22 This function dictates the growth rate of W required to keep the efficiency fixed as p increases. We call this function the isoefficiency function of the parallel system. The isoefficiency function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements. 12) Explain Matrix-Multiplication using DNS Algorithm. [Win 14(7 marks), Sum 13(7 marks), Sum 12(7 marks), Total: 21 Marks] DNS can use up to n 3 processes and that performs matrix multiplication in time ɵ(log n) by using ɵ(n3/log n) processes. This algorithm is known as the DNS algorithm because it is due to Dekel, Nassimi, and Sahni. Parallel formulation: The process arrangement can be visualized as n planes of n x n processes each. Each plane corresponds to a different value of k. Initially, as shown in Figure, the matrices are distributed among the n 2 processes of the plane corresponding to k = 0 at the base of the three-dimensional process array. Process P i,j,0 initially owns A[i, j] and B[i, j]. In figure the communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes. The shaded processes in part (c) store elements of the first row of A and the shaded processes in part (d) store elements of the first column of B. 22 P a g e

23 The vertical column of processes P i,j,* computes the dot product of row A[i, *] and column B[*, j]. Therefore, rows of A and columns of B need to be moved appropriately so that each vertical column of processes P i,j,* has row A[i, *] and column B[*, j]. More precisely, process P i,j,k should have A[i, k] and B[k, j]. For matrix A: 23 P a g e

24 The communication pattern for distributing the elements of matrix A among the processes is shown in Figure (a) (c). First, each column of A moves to a different plane such that the j th column occupies the same position in the plane corresponding to k = j as it initially did in the plane corresponding to k = 0. The distribution of A after moving A[i, j] from P i,j,0 to P i,j,j is shown in Figure (b). Now all the columns of A are replicated n times in their respective planes by a parallel one-to-all broadcast along the j axis. The result of this step is shown in Figure (c), in which the n processes P i,0,j, P i,1,j,..., P i,n-1,j receive a copy of A[i, j] from P i,j,j. At this point, each vertical column of processes P i,j,* has row A[i, *]. More precisely, process P i,j,k has A[i, k]. For matrix B: The communication steps are similar, but the roles of i and j in process subscripts are switched. In the first one-to-one communication step, B[i, j] is moved from P i,j,0 to P i,j,i. Then it is broadcast from P i,j,i among P 0,j,i, P 1,j,i,..., P n-1,j,i. The distribution of B after this one-to-all broadcast along the i axis is shown in Figure (d). At this point, each vertical column of processes P i,j,* has column B[*, j]. Now process P i,j,k has B[k, j], in addition to A[i, k]. After these communication steps, A[i, k] and B[k, j] are multiplied at P i,j,k. Now each element C[i, j] of the product matrix is obtained by an all-to-one reduction along the k axis. During this step, process Pi,j,0 accumulates the results of the multiplication from processes P i,j,1,..., P i,j,n-1. Figure shows this step for C[0, 0]. Three main communication steps: (1) moving the columns of A and the rows of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A and along the i axis for B, and (3) All-to-one reduction along the k axis. All these operations are performed within groups of n processes and take time ɵ(log n). Thus, the parallel run time for multiplying two n x n matrices using the DNS algorithm on n 3 processes is ɵ(log n). 24 P a g e

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

POSIX PTHREADS PROGRAMMING

POSIX PTHREADS PROGRAMMING POSIX PTHREADS PROGRAMMING Download the exercise code at http://www-micrel.deis.unibo.it/~capotondi/pthreads.zip Alessandro Capotondi alessandro.capotondi(@)unibo.it Hardware Software Design of Embedded

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Introduction to PThreads and Basic Synchronization

Introduction to PThreads and Basic Synchronization Introduction to PThreads and Basic Synchronization Michael Jantz, Dr. Prasad Kulkarni Dr. Douglas Niehaus EECS 678 Pthreads Introduction Lab 1 Introduction In this lab, we will learn about some basic synchronization

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Sorting (Chapter 9) Alexandre David B2-206

Sorting (Chapter 9) Alexandre David B2-206 Sorting (Chapter 9) Alexandre David B2-206 1 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto Ricardo Rocha Department of Computer Science Faculty of Sciences University of Porto For more information please consult Advanced Programming in the UNIX Environment, 3rd Edition, W. Richard Stevens and

More information

Sorting (Chapter 9) Alexandre David B2-206

Sorting (Chapter 9) Alexandre David B2-206 Sorting (Chapter 9) Alexandre David B2-206 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =

More information

Thread and Synchronization

Thread and Synchronization Thread and Synchronization pthread Programming (Module 19) Yann-Hang Lee Arizona State University yhlee@asu.edu (480) 727-7507 Summer 2014 Real-time Systems Lab, Computer Science and Engineering, ASU Pthread

More information

SCALABILITY ANALYSIS

SCALABILITY ANALYSIS SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)

More information

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 5 slides of the text, by A. Grama w/ a few changes, augmentations and corrections in colored text by Shantanu

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Lecture 4: Graph Algorithms

Lecture 4: Graph Algorithms Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e

More information

CSE 333 SECTION 9. Threads

CSE 333 SECTION 9. Threads CSE 333 SECTION 9 Threads HW4 How s HW4 going? Any Questions? Threads Sequential execution of a program. Contained within a process. Multiple threads can exist within the same process. Every process starts

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012 Pre-lab #2 tutorial ECE 254 Operating Systems and Systems Programming May 24, 2012 Content Concurrency Concurrent Programming Thread vs. Process POSIX Threads Synchronization and Critical Sections Mutexes

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

CS475 Parallel Programming

CS475 Parallel Programming CS475 Parallel Programming Sorting Wim Bohm, Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license. Sorting

More information

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Parallel Sorting Chapter 9 1 Contents General issues Sorting network Bitonic sort Bubble sort and its variants Odd-even transposition Quicksort Other Sorting

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup Sorting Algorithms - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup The worst-case time complexity of mergesort and the average time complexity of quicksort are

More information

Programming with Shared Memory. Nguyễn Quang Hùng

Programming with Shared Memory. Nguyễn Quang Hùng Programming with Shared Memory Nguyễn Quang Hùng Outline Introduction Shared memory multiprocessors Constructs for specifying parallelism Creating concurrent processes Threads Sharing data Creating shared

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Threads. Threads (continued)

Threads. Threads (continued) Threads A thread is an alternative model of program execution A process creates a thread through a system call Thread operates within process context Use of threads effectively splits the process state

More information

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.

More information

ENCM 501 Winter 2019 Assignment 9

ENCM 501 Winter 2019 Assignment 9 page 1 of 6 ENCM 501 Winter 2019 Assignment 9 Steve Norman Department of Electrical & Computer Engineering University of Calgary April 2019 Assignment instructions and other documents for ENCM 501 can

More information

Chapter 4 Concurrent Programming

Chapter 4 Concurrent Programming Chapter 4 Concurrent Programming 4.1. Introduction to Parallel Computing In the early days, most computers have only one processing element, known as the Central Processing Unit (CPU). Due to this hardware

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Introduction to parallel computing. Seminar Organization

Introduction to parallel computing. Seminar Organization Introduction to parallel computing Rami Melhem Department of Computer Science 1 Seminar Organization 1) Introductory lectures (probably 4) 2) aper presentations by students (2/3 per short/long class) -

More information

High Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 21 Matthew Jacob Indian Institute of Science Semaphore Examples Semaphores can do more than mutex locks Example: Consider our concurrent program where process P1 reads

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico March 7, 2016 CPD (DEI / IST) Parallel and Distributed

More information

Shared Memory Parallel Programming

Shared Memory Parallel Programming Shared Memory Parallel Programming Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur August 2, 2015 Abhishek, Debdeep (IIT Kgp) Parallel Programming August 2, 2015 1 / 46 Overview 1

More information

POSIX Threads: a first step toward parallel programming. George Bosilca

POSIX Threads: a first step toward parallel programming. George Bosilca POSIX Threads: a first step toward parallel programming George Bosilca bosilca@icl.utk.edu Process vs. Thread A process is a collection of virtual memory space, code, data, and system resources. A thread

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

Parallel Sorting. Sathish Vadhiyar

Parallel Sorting. Sathish Vadhiyar Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

POSIX Threads. HUJI Spring 2011

POSIX Threads. HUJI Spring 2011 POSIX Threads HUJI Spring 2011 Why Threads The primary motivation for using threads is to realize potential program performance gains and structuring. Overlapping CPU work with I/O. Priority/real-time

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 Process creation in UNIX All processes have a unique process id getpid(),

More information

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University CS 333 Introduction to Operating Systems Class 3 Threads & Concurrency Jonathan Walpole Computer Science Portland State University 1 The Process Concept 2 The Process Concept Process a program in execution

More information

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principles of Parallel Algorithm Design: Concurrency and Decomposition Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel

More information

[4] 1 cycle takes 1/(3x10 9 ) seconds. One access to memory takes 50/(3x10 9 ) seconds. =16ns. Performance = 4 FLOPS / (2x50/(3x10 9 )) = 120 MFLOPS.

[4] 1 cycle takes 1/(3x10 9 ) seconds. One access to memory takes 50/(3x10 9 ) seconds. =16ns. Performance = 4 FLOPS / (2x50/(3x10 9 )) = 120 MFLOPS. Give your answers in the space provided with each question. Answers written elsewhere will not be graded. Q1). [4 points] Consider a memory system with level 1 cache of 64 KB and DRAM of 1GB with processor

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5

CS 153 Lab4 and 5. Kishore Kumar Pusukuri. Kishore Kumar Pusukuri CS 153 Lab4 and 5 CS 153 Lab4 and 5 Kishore Kumar Pusukuri Outline Introduction A thread is a straightforward concept : a single sequential flow of control. In traditional operating systems, each process has an address

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

Multiprocessors 2007/2008

Multiprocessors 2007/2008 Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Concurrency and Synchronization. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Concurrency and Synchronization. ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Concurrency and Synchronization ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Concurrency Multiprogramming Supported by most all current operating systems More than one unit of

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1 Thread Disclaimer: some slides are adopted from the book authors slides with permission 1 IPC Shared memory Recap share a memory region between processes read or write to the shared memory region fast

More information

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References

COSC 6374 Parallel Computation. Shared memory programming with POSIX Threads. Edgar Gabriel. Fall References COSC 6374 Parallel Computation Shared memory programming with POSIX Threads Fall 2012 References Some of the slides in this lecture is based on the following references: http://www.cobweb.ecn.purdue.edu/~eigenman/ece563/h

More information

Shared-Memory Programming

Shared-Memory Programming Shared-Memory Programming 1. Threads 2. Mutual Exclusion 3. Thread Scheduling 4. Thread Interfaces 4.1. POSIX Threads 4.2. C++ Threads 4.3. OpenMP 4.4. Threading Building Blocks 5. Side Effects of Hardware

More information

Multicore and Multiprocessor Systems: Part I

Multicore and Multiprocessor Systems: Part I Chapter 3 Multicore and Multiprocessor Systems: Part I Max Planck Institute Magdeburg Jens Saak, Scientific Computing II 44/337 Symmetric Multiprocessing Definition (Symmetric Multiprocessing (SMP)) The

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Lecture 10 Midterm review

Lecture 10 Midterm review Lecture 10 Midterm review Announcements The midterm is on Tue Feb 9 th in class 4Bring photo ID 4You may bring a single sheet of notebook sized paper 8x10 inches with notes on both sides (A4 OK) 4You may

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

COMMUNICATION IN HYPERCUBES

COMMUNICATION IN HYPERCUBES PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)

More information

Hypercubes. (Chapter Nine)

Hypercubes. (Chapter Nine) Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

End-Term Examination Second Semester [MCA] MAY-JUNE 2006

End-Term Examination Second Semester [MCA] MAY-JUNE 2006 (Please write your Roll No. immediately) Roll No. Paper Code: MCA-102 End-Term Examination Second Semester [MCA] MAY-JUNE 2006 Subject: Data Structure Time: 3 Hours Maximum Marks: 60 Note: Question 1.

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

pthreads CS449 Fall 2017

pthreads CS449 Fall 2017 pthreads CS449 Fall 2017 POSIX Portable Operating System Interface Standard interface between OS and program UNIX-derived OSes mostly follow POSIX Linux, macos, Android, etc. Windows requires separate

More information

Mapping Algorithms to Hardware By Prawat Nagvajara

Mapping Algorithms to Hardware By Prawat Nagvajara Electrical and Computer Engineering Mapping Algorithms to Hardware By Prawat Nagvajara Synopsis This note covers theory, design and implementation of the bit-vector multiplication algorithm. It presents

More information

Shared Memory Programming. Parallel Programming Overview

Shared Memory Programming. Parallel Programming Overview Shared Memory Programming Arvind Krishnamurthy Fall 2004 Parallel Programming Overview Basic parallel programming problems: 1. Creating parallelism & managing parallelism Scheduling to guarantee parallelism

More information

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Sorting Algorithms Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel

More information

7 Parallel Programming and Parallel Algorithms

7 Parallel Programming and Parallel Algorithms 7 Parallel Programming and Parallel Algorithms 7.1 INTRODUCTION Algorithms in which operations must be executed step by step are called serial or sequential algorithms. Algorithms in which several operations

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD Pipeline and Vector Processing 1. Parallel Processing Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data-processing tasks for the purpose

More information

The University of Texas at Arlington

The University of Texas at Arlington The University of Texas at Arlington Lecture 10: Threading and Parallel Programming Constraints CSE 5343/4342 Embedded d Systems II Objectives: Lab 3: Windows Threads (win32 threading API) Convert serial

More information