EE/CSCI 451 Midterm 1

Size: px

Start display at page:

Download "EE/CSCI 451 Midterm 1"

Adam French
5 years ago
Views:

1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming Model 15 5 Interconnection Networks 10 6 Interconnection Networks 10 7 Analytical Modeling 12 8 Program and Data Mapping 13 Total 100 Student Name: Student USC-ID: 1

2 Problem 1 (10 2 = 20 Points) Define/Explain the following terms a. Work optimal parallel algorithm The cost of solving a problem on a single processing element is the execution time of the fastest known sequential algorithm. A parallel algorithm is work-optimal if the cost of solving a problem on a parallel computer has the same asymptotic growth as a function of the input size as the fastest-known sequential algorithm on a single processing element. A work-optimal parallel algorithm has an efficiency of O(1). b. Store and forward routing In store-and-forward routing, when a message traverses a path with multiple links, each intermediate node on the path forwards the message to the next node after it has received and stored the entire message. c. Spatial locality Spatial locality implies that if a location i is referenced at time t, then locations near i are referenced in a small window of time following t. d. Data dependency Data dependency is a situation in which a program statement (i.e., instruction) refers to the data output by a previous statement. e. Instruction-level parallelism Instruction-level parallelism (ILP) is a measure of the number of operations that can be performed simultaneously in a computer program. ILP is exploited by executing multiple operations from a program in a single cycle. f. Bisection width of a network The bisection width of a network is defined as the minimum number of communication links that must be removed to partition the network into two equal halves. g. Non blocking network In a non blocking network, any connection request from input to output can be routed without rearranging the existing set of connections. 2

3 h. Asynchronous execution Asynchronous execution has no global clock to coordinate the execution among the processors. The order of execution of the instructions depends on the input data, the scheduling algorithm, the speed of the processors, and the speed of the communication network. i. Shuffle exchange network Shuffle exchange network performs shuffle and exchange operations to route from a source x = x n 1 x 0 to a destination y = y n 1 y 0. The shuffle operation is performed by circularly shifting the bits of x. x x n 2 x 0 x n 1 The exchange operation is performed by complementing the least significant bit of x. x x n 1 ( x 0 ) j. Cache pollution Cache pollution describes the scenario in which a program loads unnecessary data into the cache resulting in the eviction of useful data into lower levels of the memory hierarchy (e.g., main memory). 3

4 Problem 2 (10 Points) Memory System Performance Consider a memory system which has 100-cycle latency DRAM and is connected to a processor that operates at 1 GHz. The processor-memory bus can support one word per cycle (streaming bandwidth = 1 word/cycle). Assume cache has been disabled. The processor has two floating point multiply-add units. Each multiply-add unit is capable of executing one multiplication and one addition per processor cycle. Thus, the processor can execute four floating point operations including two floating point multiplications and two floating point additions in each processor cycle. a. What is the peak floating point performance of the processor? State any assumption(s) you may make. (2 points) Peak performance = 1 GHz 4 FLOPS/cycle = 4 GFLOPS Consider the following program: result = 0; // The result is stored in local register for (i = 0; i < ; i++) result = result * C[i] + A[i] * B[i]; b. Assume each element of A, B, and C is one word stored in DRAM and the memory system supports streaming. What is the sustained performance in the best case (in FLOPS)? State any assumption(s) you may make. (4 points) In the best case, three operands (A[i], B[i], C[i]) are streamed from memory in every three cycles and used to perform 3 FLOPs (2 multiplications and 1 addition). The computation can be completely overlapped with memory accesses. Sustained performance (best case) = 3 FLOPs over 3 processor cycles = 1 GFLOPS c. Assume the memory system does not support streaming. What is the sustained performance in the worst case (in FLOPS)? State any assumption(s) you may make. (4 points) Three operands (A[i], B[i], C[i]) are read from memory in every cycles and used to perform 3 FLOPs. The computation can be completely overlapped with memory accesses. Sustained performance (worst case) = 3 FLOPs over 300 cycles = 0.01 GFLOPS. 4

5 Problem 3 (10 Points) Cache Performance Suppose we want to double the value of each element of a matrix A on a uni-processor. The size of each element is 1 word. The matrix is stored in row major order. The processor has a direct-mapped cache with 8 cache lines. The size of each cache line is 4 words. a. Compute the cache hit ratio for read operations when executing the following code (2 points). Explain (2 points). 1 for (i = 0; i < 128; i++) 2 for (j = 0; j < 128; j++) 3 A[i][j]=2*A[i][j]; 4 end for 5 end for The cache hit ratio for read operations is 75%. When a cache miss occurs due to accessing A[i][j], A[i][j+1], A[i][j+2], A[i][j+3] will be brought into cache in the same cache line. This results in 3 cache hits when accessing these three elements. Thus, in general, accessing every 4 elements results in 1 cache miss and 3 cache hits. b. Compute the cache hit ratio for read operations when executing the following code (2 points). Explain (2 points). 1 for (j = 0; j < 128; j++) 2 for (i = 0; i < 128; i++) 3 A[i][j]=2*A[i][j]; 4 end for 5 end for The cache hit ratio for read operations is 0%. When a cache miss occurs due to accessing A[i][j], A[i][j+1], A[i][j+2], A[i][j+3] will be brought into cache in the same cache line. However, the program needs to access A[i+1][j] next because it accesses the matrix in column major order. Thus, cache miss repeatedly occurs for accessing A[i+1][j], A[i+2][j], A[i+3][j],. c. Repeat parts a and b if the size of each cache line is 8 words. (2 points) Part a: 7/8 = 87.5%; Part b: 0% 5

6 Problem 4 (15 Points) Shared Memory Programming Model Given an undirected graph G(V, E), V = {0,, n 1}, a connected component is defined as a sub-graph such that any two vertices of the component are connected by a path in G. The root of a connected component is defined as the smallest vertex in the connected component. The label of a connected component is its root vertex. The algorithm to find the connected component (the label of the root vertex) to which each vertex belongs to is illustrated in Figure 1. For each vertex i (0 i < n), we use c(i) to keep track of the label of the connected component that the vertex belongs to. The algorithm is iterative; in each iteration, all the edges are traversed to update c(0),, c(n 1). The algorithm terminates when there is no update for any c(i) in an iteration. Then, c(i) becomes the label of the connected component that vertex i belongs to. In Figure 1, we also show an example input graph and its output. Suppose we want to parallelize the algorithm using p (p > 1) threads, with each thread executing the computation for E edges ( E =total # of edges in G) in each iteration. p We define iteration for a thread as the work done to traverse its own edges once. For this problem, we assume that threads take similar amount of time to traverse their edges in each iteration. Input: c(0) = c(4) = 4 c 1 = c 2 = 2 c 3 = 3 Output: c(0) = c(4) = 0 c 1 = c 2 = 2 c 3 = 2 Figure 1: Finding connected components a. What are the shared variables for each thread? (3 points) The shared variables for thread(i, j) are At least one vertex has update and 6

7 c(i), c(j). b. Write the pseudo code of the function executed by T hread w (0 w < p). Note that your code must ensure that at the end of k-th iteration, all vertices in a connected component at a distance less than or equal to k from the root should have the correct label of that component. (5 points) /* Pseudo code executed by Thread with index w */ 1 Let edge[] denote the array that stores the E edges 2 while (At_least_one_vertex_has_update = true) then 3 Lock(At_least_one_vertex_has_update); 4 At_least_one_vertex_has_update = false; 5 unlock(at_least_one_vertex_has_update); 6 for(g=w* E /p; g<(w+1)* E /p; g++) 7 Lock(c(edge[g].i), c(edge[g].j)); 8 m = min (c(edge[g].i), c(edge[g].j)); 9 if c(edge[g].i) > m then 10 c(edge[g].i) = m; 11 At_least_one_vertex_has_update = true; 12 end if 13 if c(edge[g].j) > m then 14 c(edge[g].j) = m; 15 At_least_one_vertex_has_update = true; 16 end if 17 Unlock(c(edge[g].i), c(edge[g].j)); 18 end for 19 barrier; 20 end while c. If your code in part b does not use any locks, will the execution ever terminate? If yes, will it be able to produce the correct output? Explain. If no, explain why the execution will never terminate. (3 points) Even if the locks are not used, the execution will still terminate. This can be established by the following fact: If in the correct program, after some iteration k, c(i) has a value l, then in the program without locks, c(i) will have a value l in iteration k + 1. This is because due to race conditions, i might miss an update from some of its neighbor j, but that update will be propagated in the next iteration. d. Given the input graph shown below, if your code in Part b does not use any lock, for p = 3, what is the total number of iterations that the algorithm executes in the best case? What is the total number of iterations that the algorithm executes in the worst case? Explain. (4 points) Input graph 1 Best case: 2 iterations Initial setup c(0) = 0, c(1) = 1, c(2) = n

8 Iteration 1 c(0) = 0, c(1) = 0, c(2) = 0 Iteration 2 c(0) = 0, c(1) = 0, c(2) = 0 no update algorithm terminates Worst case: 3 iterations Initial setup c(0) = 0, c(1) = 1, c(2) = 2 Iteration 1 c(0) = 0, c(1) = 0, c(2) = 1 (c(2) is incorrectly updated due to race condition) Iteration 2 c(0) = 0, c(1) = 0, c(2) = 0 Iteration 3 c(0) = 0, c(1) = 0, c(2) = 0 no update algorithm terminates 8

9 Problem 5 (10 Points) Interconnection Networks Definition 5.1: A p-input and p-output CLOS network can be defined as a 3-stage network where Stage 0 and Stage 2 consist of two p p switches and Stage 1 consists 2 2 of p 2 2 switches. 2 a. Draw such a network for p = 8. (2 points) The network is as shown below: 4 X 4 4 X 4 4 X 4 4 X 4 Figure 2: CLOS network for p = 8 b. Apply Definition 5.1 of CLOS network to recursively decompose all the switches in Stage 0 and Stage 2 until the network only has 2 2 switches. Draw such a network for p = 8. (3 points) The network is as shown below: Figure 3: CLOS network for p = 8 c. In general, derive an expression for the total number of switches and the total delay from an input to an output for an n input and n output CLOS network which is obtained by recursively applying Definition 5.1 to the switches in Stage 0 and Stage 2 as done in part (b). Use order notation. Note that the final network consists of only 2 2 switches. (Assume the delay of each 2 2 switch is equal to 1 unit) (5 points) 9

10 The recurrence relation for the total number of switches for a network of size n: S(n) is S(n) = 4S( n 2 ) + n 2, S(2) = 1 = 4[4S( n 2 2 ) + n 2 2 ] + n 2 = 4 2 S( n 2 2 ) + n + n 2 = 4 2 [4S( n 2 3 ) + n 2 3 ] + n + n 2 = 4 3 S( n 2 3 ) + 2n + n + n 2 =... = 4 k S(2) + 2 k 2 n + + n + n 2 = 4 k S(2) + n 2 [2k ], k = log 2 n 1 = 4 k S(2) + n 2 [2k 1], k = log 2 n 1 = S(n) = n2 2 n 2 = θ(n2 ) The recurrence relation for the total delay for a network of size n: D(n) is D(n) = 2D( n ) + 1, D(2) = 1 2 = 2[2D( n 2 2 ) + 1] + 1 = 2 2 D( n 2 2 ) =... = 2 k D(2) + 2 k = 2 k+1 1, k = log 2 n 1 = D(n) = n 1 = θ(n) 10

Problem 6 (10 Points) Interconnection Networks In HW #2, we designed a 2 k -node Shuffle-Exchange

In this problem, we generalize the design to support l dimensions (l k).

Each chunk of k bits is used to perform SE routing in l each of the l dimensions.

(2 points) Shuffle Exchange 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101

Show all the intermediate nodes while routing from source s = 1110 to destination d = 1001 in the

11 Problem 6 (10 Points) Interconnection Networks In HW #2, we designed a 2 k -node Shuffle-Exchange network along 2 dimensions. In this problem, we generalize the design to support l dimensions (l k). We evenly divide the k bits into l chunks. Each chunk of k bits is used to perform SE routing in l each of the l dimensions. We assume k is divisible by l. a. Draw the network for k = 4 and l = 2. (2 points) Shuffle Exchange b. Show all the intermediate nodes while routing from source s = 1110 to destination d = 1001 in the network. (2 points) Intermediate nodes: c. Assume k and l are even numbers. Suppose we route from s = (all 00s) to d = (all 10s), what is the total path length (each shuffle or exchange operation takes 1 unit) in terms of k and l? (4 points) Along each dimension, we need to perform k l shuffle operations and k 2l exchange operations. Thus, for l dimensions the total path length is 3k 2. d. Suppose k = l, comment on the resulting network. (2 points) When k = l, the network becomes a k-dimensional hypercube network. 11

12 Problem 7 (12 Points) Analytical Modeling In this problem, we will perform analytical model of the algorithm to compute π that you implemented in PHW 1. Consider the following algorithm: #define N 1000 CalculatePI 1 num_of_points_in_circle = 0; 2 num_of_points_not_in_circle = 0; for (i = 0; i < N; i++) { 3.1 Generate a random point in unit square 3.2 Check whether the point is inside the unit circle or not if Yes num_of_points_in_circle++; else num_of_points_not_in_circle++; } 4 PI = 4*num_of_points_in_circle/N; Assume each statement which has a line number on the left takes 1 cycle to execute. Note that only for loop is parallelizable. Assume that the code is run on perfect parallel architecture with no overheads such as communication, coordination, or thread creation overheads. Also ignore any concurrency issues. i.e. you can assume that even if multiple threads write to the same location, any race conditions are taken care of by the architecture without any additional overhead and the intended result is produced. a. Calculate S and P. (3 points) S: time taken by the portion of the code which cannot be parallelized. P : time taken in a serial program by the portion of the code which can be parallelized. S = 3, statements 1,2 and 4 P = N 3 = 3000 statements They will run 1000 times in a serial program. b. Now, if we use p N threads to parallelize the for loop, derive an expression for the overall speedup achieved in terms of p. What is the maximum speedup that can be achieved? What is the value of p that achieves the maximum speedup? (3 points) Speedup = p Maximum Speedup = 3003 = Maximum speedup is achieved when p = 1000 c. Derive an expression for Efficiency in terms of p. (3 points) Efficiency = p

13 d. Now consider the following algorithm: CalculatePI(N) 1 num_of_points_in_circle = 0; 2 num_of_points_not_in_circle = 0; for (i = 0; i < N; i++) { 3.1 Generate a random point in unit square 3.2 Check whether the point is inside the unit circle or not if Yes 3.3 num_of_points_in_circle++; else 3.4 num_of_points_not_in_circle++; } 4 PI = 4*num_of_points_in_circle/N; Note that in this case, N is a parameter to the function and is not fixed as in parts a,b and c. In this part, we use p = N threads to parallelize the for loop. Derive an expression for the scaled speedup achieved in terms of p. (3 points) Scaled speedup = 3+3p 3+3 = 3+3p 6 0.5p 13

Problem 8 (13 Points) Program and Data Mapping Assume a k-node fully connected network G is embedded in a k-node ring G as follows: (a) node i in the fully connected network is mapped to node i in

14 Problem 8 (13 Points) Program and Data Mapping Assume a k-node fully connected network G is embedded in a k-node ring G as follows: (a) node i in the fully connected network is mapped to node i in the ring (0 i < k); (b) let e ij be the edge from node i G to j G; then edge mapping function is as follows: e ij path in G starting from i and moving in clockwise manner to reach j. e.g. e 03 is mapped to {(0, 1), (1, 2), (2, 3)}. Note that e ij and e ji may be mapped to different paths in G. However, assume that k-node ring is undirected Figure 4: Embedding a fully connected network into a ring for k = 6 a. Figure 4 shows the mapping for k = 6. Derive the values of dilation and congestion for k = 6. (3 points) Mapping function f: i i Dilation = 5, (1,0) {(1,2), (2,3), (3,4), (4,5), (5,0)} Congestion = = 15 {e.g. (1,2)}. There are 30 directed edges in G. Out of each edge pair e ij and e ji, exactly one edge will map to any arbitrary edge in G. Hence 15 edges in G will map to any edge in G. b. Derive the exact expressions for dilation and congestion in terms of k (Assume k is even). (5 points) Dilation = k 1, a single edge between adjacent nodes will have to travel through the entire ring. Congestion = (k 1) + (k 2) = (k 1)k, every undirected edge in G will add 1 2 to the number of paths going each edge. Hence, congestion is the number of undirected edges in G. c. Now assume that the k-node fully connected network is mapped to a k-node linear array (without wraparound). In this problem, an edge e ij in G is mapped to the shortest path in G. Derive the exact expressions for dilation and congestion in terms of k. (Assume k is even. Also assume e ij and e ji are distinct edges in G). (5 points) Dilation = k 1 Congestion = k k 2 = k2, occurs at the edge connection the vertices ( n 1, n)

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1: