University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

Size: px

Start display at page:

Download "University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2."

Alvin O’Connor’
5 years ago
Views:

1 University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be done statically. The main questions that must be addressed are: (i) whether to decompose in terms of rows, columns or rectangular blocks/tiles, (ii) whether to assign these units in contiguous chunks or in an interleaved manner. Let s assume that we partition the source and destination matrices in the same way, which is often what we have to do in many applications (because this is probably just a phase of an more complex application), and that a processor writes only the data in its partition of the destination matrix. The necessary elements of the source matrix must be read by each processor. Suppose we partition the destination matrix in chunks of contiguous rows, each partition being of size n/p for an n-by-n matrix and p processors. To obtain the data to be transpose into its p/n rows, a processor must obtain data from a set of n/p contiguous columns in the source matrix. Since the source matrix is also partitioned into the same sets of contiguous rows as the source matrix, the processor will obtain square chunks of size n/p-by-n/p from each other processor (and a single such square from itself). These chunks will have to be transposed in this process. There is little reason to not use contiguous partitions. For example, in an interleaved rowwise partition, when a process reads a single column from the source matrix, it does not exploit spatial locality effectively (assuming a row-major language like C); spatial locality is exploited in the row being written in the destination. This is true for both programming models, though the problem is more accute in a shared address space since the reads from the source matrix are mostly remote reads and a lot of unnecessary data transfers occur when reading cache blocks. In message-passing environments, interleaved partitions can make it somewhat more complex to transfer the data that one processor needs from another in a single message (more scattering and gathering of data are required). So contiguous partitions are better (note that the same reasoning would apply if the destination matrix was partitioned in chunks of columns and also similar effect in a column-major language such as Fortran). Other than the above spatial locality and scatter/gather considerations, it does not matter too much whether the programming model is a shared address space or message passing. One difference is that in a shared address space the transposition of the n/p-by-n/p subgrids is done during the remote reading and local writing process, with no extra copy of data, while in message passing usually the subgrid is transposed locally before transferring it, which may incur some extra overhead. Another difference is that in a shared address space the communication is receiver-initiated (via remote reads) while, in message-passing, it is usually sender-initiated (the process owning a source subgrid send it to the destination process, which then writes it locally.), so some latency is hidden from the receiver. Also a single message transfers all the data rather than transferring it cache line by cache line in a request-response fashion. In the message-passing case, the data may

2 end up in the destination cache or only in its local memory, depending how the system implements it. (b) It is called all-to-all personalized since each process communicates with every other process, but communicates different data with every other process (here a different n/p by n/p subgrid). This is unlike a p-way broadcast in which each process also sends data to every other process, but sends the same data to every other process. (c) shared-memory: int mymin = 1 + (pid*n/nprocs) int mymax = mymin + n/nprocs - 1 for (k = mymin to mymax) for (i = 0 to n-1) B[k,i] = A[i,k]; message-passing: We use two array STR and RTR to send and receive blocks of matrix A. mya <- malloc (a 2-D array of size n/nprocs by n) myb <- malloc (a 2-D array of size n/nprocs by n) STR <- malloc (a 2-D array of size n/nprocs by n/nprocs) RTR <- malloc (a 2-D array of size n/nprocs by n/nprocs) initialize (mya);/mya and myb are two /contiguous row partitions of A and B for (k = 0; k < nprocs) if (k == pid) /this takes care of the diagonal submatrix for (j = pid*nprocs; j < (pid+1)*nprocs -1) for (i = 0; i < nprocs) myb[i,j] = mya[j,i]; else for (j = 0; j < nprocs) for (i = 0; i < nprocs) STR[i,j] = mya[j,k*nprocs+i]; SEND(&STR[0,0], nprocs*nprocs*sizeof(float), k) RECEIVE(&RTR[0,0],nprocs*nprocs*sizeof(float), k) for (j = 0; j < nprocs) for (i = 0; i < nprocs) myb[k*nprocs+i,j] = RTR[i,j]; endif endfor Note that for the message passing program to work the message passing primitives must be nonblocking. Besides the issues raised above in (a) the order in which the all-to-all personalized communication is done is critical. A naive way to do it is for a given process (say process i) to start communicating with process 0 and go on to process p-1 in a for(i=0;i<p;i++) loop. This causes a lot of contention first at process 0 when all processes try to communicate with it, then at process

3 1, and so on. The solution is for process i to start with process i+1 then i+2 until p-1, then process 0, etc until process i-1. (d) Blocking is usually done for temporal locality, works if the data in a block are reused. In the matrix transpose there is no data reuse. However, blocking can still be useful since cache blocks are reused when different elemnts are accessed in them. Consider a row-major language, to be concrete. If the source matrix is traversed down a column, each consecutive read brings a new block in the cache. If the cache is too small we may start evicting blocks as we go down the same column. This is not good because the blocks will be read again when accessing the next column. With blocking we go down the column up to a point such as the accessed blocks comfortably hold in the cache. Then we start at the beginning of the next column When we have traversed the first elements of B consecutive columns, we may go back to the first column and process the following elements. Problem 2 Problem 4.11 in CSG The target problem s size is 8Kx8Kx8=512MB. Per process subarray size is 512M/256=2MB. The cache of the target system is less than that (it s 256K, one eigth of 2MB). Thus consider the next working set: three rows of the subarray in each process. The size is 8Kx3x8/256 < 1K. This is very small wrt to the cache size (<1/256). The simulated problem is 512x512x8=2M.Per processor working set is 2M/64=32K. The size of three rows is 512x3x8/64 < 256. (a) We should pick cache sizes in proportion to working set size per processor in both situations. Considering the larger working set. To scale the cache wrt to this working set we should pick a cache size of 256Kx32K/2MB=4K. A 4K cache will contain 16 times the smaller working set in the simulated system, whereas the 256K contains the smaller working set 256 times in the target system. We could consider reducing the cache size further to say 1K in the simulated system. But then the cache becomes dangerously small (see (b)). So we ll stick to 4K and we ll compensate by decreasing the block size. (b) When the cache is scaled down it could become so small that the number of blocks in the cache is very small. In this case, an 4KB cache contains 4K/128 = 32 blocks. This will exacerbate the problems due to spatial locality. Looking at capacity misses, if the cache size is smaller then the best block size must be smaller. Then looking at coherence misses, we see that the cost of communication will increase in the smaller cache. Moreover cache conflicts may also become a problem because the samller the cache the more opportunies for conflicts. (c) The selected cache is not big enough to contain the partition of the grid. However it is big enough to contain a few rows. Moreover because the array is organized as a 4-D array, false sharing is not a problem. Because we need to reload blocks as we scan the grid, one possible problem is the number of misses when reloading a row of the subarray. In the taget system the size of a row of a grid partition is 512x8/128 blocks. This will give us 32 misses when reloading a row. On the other hand the subgrid has size 64x8/128 blocks in the simulated system. This will give only 4 misses to reload a row!! To get the same number of misses we should reduce the block size by 8 in the simulated system and use 16bytes instead of 128 bytes.

4 To maintain the same amount of communication (boundary) versus computation (internal) block transfers since two blocks are communicated at the boundary and 32 internal blocks are reloaded for each row of the subgrid. As a result of adopting a block size of 16 in the simulation the number of blocks in the simulated cache is 4K/16=256. This compares with 256K/128=2K blocks in the target system. It would would be useful possibly to use a two way set associative cache in the simulated system because conflicts may be much higher in the direct-mapped simulated cache than in the direct-mapped target cache. (d) I would not try to approximate the results of the simulated run with a target run using the real workload. And the speedup numbers will probably be totally off. I would use the reduced, simulated system to evaluate possible improvements to the target machine. Problem 3 Problem 5.1 in CSG. Register allocation is a problem in a multiprocessor system. Consider the case of a processor spinning on a flag variable waiting for another processor to write that variable. If the former processor allocates the variable in a register, then it will never see the writes done to the variable by the latter and will continue to spin-wait forever. This applies to several other cases of writes to shared data not being seen by other processors since they have register-allocated these data. A possible solution is to never register-allocate shared data. But this is very restrictive. One solution that current systems use is to require that variables that should not be allocated in registers to be declared volatile, which ensures that the compiler will not register-allocate them. Problem 4 Problem 5.5 in CSG a. Update protocols become MORE preferable because communication is implemented by a sequence of an invalidation followed by a miss in an invalidate protocol whereas an update suffices in a write-update protocol. b. If we get an update in the second-level cache then we must either invalidate or update the first level cache. If we invalidate the FLC then the block must be reloaded on a access to any word in the block. It seems that updating is better than invalidating, from a performance point of view. c. Because there is no sharing at the user level. Update protocols are only good for cases of intense read/ write sharing. d. The good thing about page-level adaptation of the protocol is that it is easy to implement. For example, one of the state bits in the page table and TLB may be an update vs invalidate bit. The memory access first reaches the TLB where the bit is looked up and the access is treating accordingly. On the other hand the special opcodes are a hassle because they require extension of the instruction set. However the same effect could be obtained by using one address bit to specify the mode. If the compiler can effectively discernate between writes that should update and writes that should invalidate the opcode approach is much more precise and effective and the page-based approach is too coarse.

5 Problem 5 Problem 5.6 in CSG (a) There are many SC interleaving possible here so we ll just take each possible output of the program and verify that there is at least one SC interleaving producing each result. There are 8 possible program outcome: 1. (u,v,w)=(0,0,0) This can be obtained by the following SC iinterleaving: L 2 A (0) =>L3 B (0) =>L3 A (0) =>S1 A (1)=>S2 B (1) 2. (u,v,w)=(0,0,1) This can be obtained by the following SC iinterleaving: L 2 A (0) =>L3 B (0) =>S1 A (1)=>L3 A (1) =>S2 B (1) 3. (u,v,w)=(0,1,0) L 2 A (0) =>S2 B (1)=>L3 B (1)=>L3 A (0) =>S1 A (1) 4. (u,v,w)=(0,1,1) L 2 A (0) =>S2 B (1)=>L3 B (1)=>S1 A (1)=>L3 A (1) 5. (u,v,w)=(1,0,0) L 3 B(0)=>L 3 A(0)=>S 1 A(1)=>L 2 A(1) =>S 2 B(1) 6. (u,v,w)=(1,0,1) L 3 B (0)=>S1 A (1)=>L2 A (1) =>S2 B (1)=>L3 A (1) 7. (u,v,w)=(1,1,1) S 1 A (1)=>L2 A (1) =>S2 B (1)=>L3 B (1)=>L3 A (1) 8. (u,v,w)=(1,1,0) This is impossible. This can be proven as follows. SC must conform to both coherence order (co) and program order (po). Because of co, we must have S 1 A (1)=>L2 A (1), S1 A (1)=>L3 A (1) and S2 B (1)=>L3 B (1). Moreover because of po, we must have L 3 B (1) =>L3 A (1) and L2 A (1) =>S2 B (1). So: S 1 A (1)=>L2 A (1)=>S2 B (1)=>L3 B (1)=>L3 A (1). Then the co order S1 A (1)=>L3 A (1) creates a cycle in the execution graph and execution cannot be SC. Basically, P2 writes the new value of B after observing the new value of A. If P3 observes the new value of B modified by P2 and then the old value of A, P2 and P3 have observed two writes in a different order, a violation of SC. We could have discarded off the bat all outcomes ending in 00 or 11 because in both cases P3 only reads values and cannot distinguish between new and old values (its accesses can be inserted at the end or the beginning of the sequence). This would have removed 4 cases. (b) This case has 16 possible execution outcomes. Let s try to weed out some obvious ones. To do that we will look at interleavings in which the two accesses of P2 and P3 directly follow each others. So: 1. P1,P2,P3,P4 yields (1,0,1,1) 2. P1,P2,P4,P3 yields (1,0,0,1) 3. P1,P3,P2,P4 yields (1,1,1,1) 4. P1,P4,P3,P2 yields (1,1,0,1)

6 5. P2,P1,P3,P4 yields (0,0,1,1) 6. P2,P1,P4,P3 yields (0,0,0,1) 7. P2,P4,P1,P3 yields (0,0,0,0) 8. P3,P2,P1,P4 yields (0,1,1,1) 9.P3,P2,P4,P1 yields (0,1,1,0) 10. P4,P3,P1,P2 yields (1,1,0,0) Now we are down to 6 outcomes. We systematically look for the remaining possible outcomes. 11. Outcome (0,0,1,0) One possible SC execution is L 2 A (0) =>L2 B (0)=>S3 B (1)=>L4 B (1)=>L4 A (0)=>S1 A (1) 12. Outcome (0,1,0,0) One possible SC execution is L 2 A (0) =>L4 B (0)=>S3 B (1)=>L2 B (1) =>L4 A (0)=>S1 A (1) 13. Outcome (0,1,0,1) One possible SC execution is L 2 A (0) =>L4 B (0)=>S3 B (1)=>L2 B (1) =>S1 A (1)=>L4 A (1) 14. Outcome (1,1,1,0) One possible SC execution is S 3 B (1)=>L4 B (1)=>L4 A (0) =>S1 A (1)=>L2 A (1) =>L2 B (1) 15. Outcome (1,0,0,0) One possible SC execution is L 4 B (0)=>L4 A (0)=>S1 A (1)=>L2 A (1) =>L2 B (0) =>S3 B (1) 16. Outcome (1,0,1,0) This is impossible. This can be proven as follows. SC must conform to both coherence order (co) and program order (po). Because of co, we must have S 1 A (1)=>L2 A (1), L4 A (0)=>S1 A (1), S3 B (1)=>L4 B (1) and L 2 B (0)=>S3 B (1). because of po, we must have L2 A (1) =>L2 B (0) and L4 B (1) =>L4 A (0). So: S 1 A (1)=>L2 A (1) =>L2 B (0)=>L2 B (0)=>S3 B (1)=>L4 B (1)=>L4 A (0)=>S1 A (1). There is a cycle in the execution graph and execution cannot be SC. Again what happens here is that P2 and P4 observe two stores in a different order. We could have solved this problem much faster by noticing that the violations of SC could only be detected by P2 and P4 and that the only stores, by P1 and P3, are unordered. Thus, unless P2 or P4 can distinguish between the old and the new values, SC is not violated. The following outcomes could have been discarded off the bat : 1) any outcome starting with 0,0 is SC since it implies that P2 did not observe any of the two stores and so could not distinguish between old and new values. Same thing for any outcome ending with 0,0. 2) any outcome starting with 1,1 is also SC because P2 observed both new values; therefore it cannot make a difference between old and new. Same thing with any outcome ending with 1,1. This strategy would have eliminated 12 outcomes and left us with only 4 to explore further: (0101), (1010), (0110) and (1001). (c) First case: the codes for P1 and P2 are atomic (F&A) The only possible SC executions are P1,P2 or P2,P1, which means A=2 and (u,v)=(0,1) or (1,0). Other outcomes (u,v)=(1,1), (0,0) and any outcome with a value of u or v of 2 are impossible. Second case: the codes are not atomic. (0,1) and (1,0) are still possible SC outcome. (0,0) is also possible if L 1 A (0) =>L2 A (0)=>S1 A (1)=>S2 A (1) Any outcome in which one of the two values u or v is not 0 is impossible because at least one of the two loads must be performed before any one of the stores can be performed in any SC interleaving, due to program order.

7 Problem 6 Problem 5.20 in CSG The problem with this code is that one process could be delayed in the while loop after executing the fetch and add. During this delay it is possible that all other processes would go through the barrier and one of them would reach the barrier again. As a result, this delayed process would end up one iteration behind all other processes, an unintended outcome. A number of solutions are possible. One is the use of a toggle flag to differentiate between consecutive iterations. global boolean flag := true BARRIER (B: Barvariable, N: Integer) { int local_flag = not flag; if (F&A(B,1) = N-1) { B:=0; flag := local_flag; } while (flag!= local_flag) do {}; } Here again a process could get delayed in the while loop. However, now no one can update the global flag before all processes pass the barrier completely and finish their next iteration. Problem 7 Problem 5.26 in CSG P1 P2 P3 Miss Classification 1 st w0 st w7 P1 and P3 miss 2 ld w6ld w2 P1&P2 miss. P1@1:pure cold 3 ld w7 P2 miss. P2@2:cold false sharing 4 ld w2ld w0 P1&P2 miss. P1@2:cold false sharing. P2@3:cold true sharing. 5 st w2 P2upgrade;P1@4:pure capacity. 6 ld w2 P1 miss. 7 st w2ld w5ld w5 P2 miss; P1@6:pure true sharing; P2@4: capacity true sharing 8 st w5 P1 miss; 9 ld w3ld w7 P2&P3 miss;p2@7:pure capacity; P3@1:pure cold 10 ld w6ld w2 P2&P3miss;P2@9:pure capacity; P3@9:pure false sharing 11 ld w2st w7 P2&P3miss;P2@10:pure capacity; P3@10:cold true sharing 12 ld w7 P1 misses; 13 ld w2 P1 misses;p1@12: pure true sharing. 14 ld w5 P2 misses;p2@11:pure capacity 15 ld w2 P3 misses;p3@11: pure capacity.

ECE453: Advanced Computer Architecture II Homework 1

ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes