EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

Size: px

Start display at page:

Download "EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100"

Melina Ellis
5 years ago
Views:

1 EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task parallelism exploits the parallelism by distributing the execution of different tasks across different parallel processing elements. 2. Race condition: When the output of a parallel program for a given input is nondeterministic as it depends upon the rate at which the various threads are executing, the program has a race condition. 3. PRAM: PRAM is a shared memory programming model which consists of p (p > 1) processors connected to a shared memory executing in synchronous manner (using a common clock). Each computation and each access to memory take 1 unit of time. 4. Shared memory programming model: Shared memory programming model provides a globally shared data space that is accessible to all the threads. Threads can also have their own private data. Programmer is responsible for synchronizing access globally shared data to ensure correctness of the program. 5. Asynchronous execution: Asynchronous execution has no global clock to coordinate execution. The order of execution of instructions depends on input data, scheduling algorithm, speed of the processors, and speed of communication network. 1

2 2 [25 points] 1. For simplicity, let us assume the number of threads w is a power of 2 and denoted as 2 m, m k. We evenly divide the input vector p into w sub-vectors, each sub-vector with length 2 k m. These sub-vectors are denoted as p sub0,...,p subw 1. Similarly, we can obtain q sub0,...,q subw 1. Then, we use T hread i (0 i < w) to compute the dot product of p subi and q subi. After each thread obtains a partial dot product, we sum up these partial dot products following the algorithm in Lecture 7 (Title: Adding in PRAM) to obtain the final result. The time complexity for the serial execution is O(2 k )+O(2 k 1) = O(2 k ). The time complexity for the parallel execution is O(2 k m )+O(log w) = O(2 k m )+O(m) = O(2 k m ) = O(2 k O(2 /w). Therefore, the speedup is ) =O(w) scalable solution. O(2 k /w) 2. 1 /* Pseudo code executed by the thread with index id */ 2 Partial_dot_product[id]=0; // Partial_dot_product is a shared array to store partial dot products; the final result will be output as Partial_dot_product[0] by the thread with index 0 3 for (i = id*2 k m ; i < (id+1)*2 k m ; i++) 4 Partial_dot_product[id]+ = p i q i ; 5 end for 6 barrier; // A barrier is needed here to synchronize threads 7 for (i = 0; i < m; i++) 8 if(id mod 2 i+1 = 0) then 9 Partial_dot_product[id]+ =Partial_dot_product[id+2 i ]; 10 end if 11 barrier; 12 end for 2

3 3 [30 points] 1. The shared variables include At least one vertex has update and the s array which records the shortest path lengths /* Pseudo code executed by Thread(i,j) */ 2 for (k = 0; k < # of vertices; k++) 3 if At_least_one_vertex_has_update = true then 4 At_least_one_vertex_has_update = false; 5 barrier; 6 Lock(s(i), s(j)); 7 if s(i)+w(i,j)<s(j) then 8 s(j) = s(i)+w(i,j); 9 At_least_one_vertex_has_update = true; 10 end if 11 Unlock(s(i), s(j)); 12 else then 13 Return; 14 end if 15 barrier; 16 end for 3

4 4 [25 points] The Pthreads program discussed in class when converted into PRAM will cause multiple writes to same location C(i, k) in k-th iteration by threads (i, 1 : n). We can have another for loop within each of the k iterations to serialize the thread accesses by threads (i, 1 : n). The final program will take n 2 clocks. A better approach is to rearrange the data accesses by threads to the elements in C. The program can be implemented in n clocks as shown by the program below. Thread(i,j) { Do k from 1 to n index = (k + j - 1) % (n+1) + floor((k + j - 1)/(n+1)) C(i, index) = C(i, index) + A(i,j)*B(j, index) End } 4

5 5 [10 points] No, the parallel execution will not produce the same output A. This is because the loop we are parallelizing has dependence within the loop (i.e., loop dependence). For example, in the serial version, A[i][j 1] is always computed before A[i][j]; but in the parallel version, it is likely that A[i][j 1] is computed after A[i][j] due to the scheduling of threads; this is problematic because the computation of A[i][j] depends on A[i][j 1]. 5

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming