University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.
|
|
- Alvin O’Connor’
- 5 years ago
- Views:
Transcription
1 University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be done statically. The main questions that must be addressed are: (i) whether to decompose in terms of rows, columns or rectangular blocks/tiles, (ii) whether to assign these units in contiguous chunks or in an interleaved manner. Let s assume that we partition the source and destination matrices in the same way, which is often what we have to do in many applications (because this is probably just a phase of an more complex application), and that a processor writes only the data in its partition of the destination matrix. The necessary elements of the source matrix must be read by each processor. Suppose we partition the destination matrix in chunks of contiguous rows, each partition being of size n/p for an n-by-n matrix and p processors. To obtain the data to be transpose into its p/n rows, a processor must obtain data from a set of n/p contiguous columns in the source matrix. Since the source matrix is also partitioned into the same sets of contiguous rows as the source matrix, the processor will obtain square chunks of size n/p-by-n/p from each other processor (and a single such square from itself). These chunks will have to be transposed in this process. There is little reason to not use contiguous partitions. For example, in an interleaved rowwise partition, when a process reads a single column from the source matrix, it does not exploit spatial locality effectively (assuming a row-major language like C); spatial locality is exploited in the row being written in the destination. This is true for both programming models, though the problem is more accute in a shared address space since the reads from the source matrix are mostly remote reads and a lot of unnecessary data transfers occur when reading cache blocks. In message-passing environments, interleaved partitions can make it somewhat more complex to transfer the data that one processor needs from another in a single message (more scattering and gathering of data are required). So contiguous partitions are better (note that the same reasoning would apply if the destination matrix was partitioned in chunks of columns and also similar effect in a column-major language such as Fortran). Other than the above spatial locality and scatter/gather considerations, it does not matter too much whether the programming model is a shared address space or message passing. One difference is that in a shared address space the transposition of the n/p-by-n/p subgrids is done during the remote reading and local writing process, with no extra copy of data, while in message passing usually the subgrid is transposed locally before transferring it, which may incur some extra overhead. Another difference is that in a shared address space the communication is receiver-initiated (via remote reads) while, in message-passing, it is usually sender-initiated (the process owning a source subgrid send it to the destination process, which then writes it locally.), so some latency is hidden from the receiver. Also a single message transfers all the data rather than transferring it cache line by cache line in a request-response fashion. In the message-passing case, the data may
2 end up in the destination cache or only in its local memory, depending how the system implements it. (b) It is called all-to-all personalized since each process communicates with every other process, but communicates different data with every other process (here a different n/p by n/p subgrid). This is unlike a p-way broadcast in which each process also sends data to every other process, but sends the same data to every other process. (c) shared-memory: int mymin = 1 + (pid*n/nprocs) int mymax = mymin + n/nprocs - 1 for (k = mymin to mymax) for (i = 0 to n-1) B[k,i] = A[i,k]; message-passing: We use two array STR and RTR to send and receive blocks of matrix A. mya <- malloc (a 2-D array of size n/nprocs by n) myb <- malloc (a 2-D array of size n/nprocs by n) STR <- malloc (a 2-D array of size n/nprocs by n/nprocs) RTR <- malloc (a 2-D array of size n/nprocs by n/nprocs) initialize (mya);/mya and myb are two /contiguous row partitions of A and B for (k = 0; k < nprocs) if (k == pid) /this takes care of the diagonal submatrix for (j = pid*nprocs; j < (pid+1)*nprocs -1) for (i = 0; i < nprocs) myb[i,j] = mya[j,i]; else for (j = 0; j < nprocs) for (i = 0; i < nprocs) STR[i,j] = mya[j,k*nprocs+i]; SEND(&STR[0,0], nprocs*nprocs*sizeof(float), k) RECEIVE(&RTR[0,0],nprocs*nprocs*sizeof(float), k) for (j = 0; j < nprocs) for (i = 0; i < nprocs) myb[k*nprocs+i,j] = RTR[i,j]; endif endfor Note that for the message passing program to work the message passing primitives must be nonblocking. Besides the issues raised above in (a) the order in which the all-to-all personalized communication is done is critical. A naive way to do it is for a given process (say process i) to start communicating with process 0 and go on to process p-1 in a for(i=0;i<p;i++) loop. This causes a lot of contention first at process 0 when all processes try to communicate with it, then at process
3 1, and so on. The solution is for process i to start with process i+1 then i+2 until p-1, then process 0, etc until process i-1. (d) Blocking is usually done for temporal locality, works if the data in a block are reused. In the matrix transpose there is no data reuse. However, blocking can still be useful since cache blocks are reused when different elemnts are accessed in them. Consider a row-major language, to be concrete. If the source matrix is traversed down a column, each consecutive read brings a new block in the cache. If the cache is too small we may start evicting blocks as we go down the same column. This is not good because the blocks will be read again when accessing the next column. With blocking we go down the column up to a point such as the accessed blocks comfortably hold in the cache. Then we start at the beginning of the next column When we have traversed the first elements of B consecutive columns, we may go back to the first column and process the following elements. Problem 2 Problem 4.11 in CSG The target problem s size is 8Kx8Kx8=512MB. Per process subarray size is 512M/256=2MB. The cache of the target system is less than that (it s 256K, one eigth of 2MB). Thus consider the next working set: three rows of the subarray in each process. The size is 8Kx3x8/256 < 1K. This is very small wrt to the cache size (<1/256). The simulated problem is 512x512x8=2M.Per processor working set is 2M/64=32K. The size of three rows is 512x3x8/64 < 256. (a) We should pick cache sizes in proportion to working set size per processor in both situations. Considering the larger working set. To scale the cache wrt to this working set we should pick a cache size of 256Kx32K/2MB=4K. A 4K cache will contain 16 times the smaller working set in the simulated system, whereas the 256K contains the smaller working set 256 times in the target system. We could consider reducing the cache size further to say 1K in the simulated system. But then the cache becomes dangerously small (see (b)). So we ll stick to 4K and we ll compensate by decreasing the block size. (b) When the cache is scaled down it could become so small that the number of blocks in the cache is very small. In this case, an 4KB cache contains 4K/128 = 32 blocks. This will exacerbate the problems due to spatial locality. Looking at capacity misses, if the cache size is smaller then the best block size must be smaller. Then looking at coherence misses, we see that the cost of communication will increase in the smaller cache. Moreover cache conflicts may also become a problem because the samller the cache the more opportunies for conflicts. (c) The selected cache is not big enough to contain the partition of the grid. However it is big enough to contain a few rows. Moreover because the array is organized as a 4-D array, false sharing is not a problem. Because we need to reload blocks as we scan the grid, one possible problem is the number of misses when reloading a row of the subarray. In the taget system the size of a row of a grid partition is 512x8/128 blocks. This will give us 32 misses when reloading a row. On the other hand the subgrid has size 64x8/128 blocks in the simulated system. This will give only 4 misses to reload a row!! To get the same number of misses we should reduce the block size by 8 in the simulated system and use 16bytes instead of 128 bytes.
4 To maintain the same amount of communication (boundary) versus computation (internal) block transfers since two blocks are communicated at the boundary and 32 internal blocks are reloaded for each row of the subgrid. As a result of adopting a block size of 16 in the simulation the number of blocks in the simulated cache is 4K/16=256. This compares with 256K/128=2K blocks in the target system. It would would be useful possibly to use a two way set associative cache in the simulated system because conflicts may be much higher in the direct-mapped simulated cache than in the direct-mapped target cache. (d) I would not try to approximate the results of the simulated run with a target run using the real workload. And the speedup numbers will probably be totally off. I would use the reduced, simulated system to evaluate possible improvements to the target machine. Problem 3 Problem 5.1 in CSG. Register allocation is a problem in a multiprocessor system. Consider the case of a processor spinning on a flag variable waiting for another processor to write that variable. If the former processor allocates the variable in a register, then it will never see the writes done to the variable by the latter and will continue to spin-wait forever. This applies to several other cases of writes to shared data not being seen by other processors since they have register-allocated these data. A possible solution is to never register-allocate shared data. But this is very restrictive. One solution that current systems use is to require that variables that should not be allocated in registers to be declared volatile, which ensures that the compiler will not register-allocate them. Problem 4 Problem 5.5 in CSG a. Update protocols become MORE preferable because communication is implemented by a sequence of an invalidation followed by a miss in an invalidate protocol whereas an update suffices in a write-update protocol. b. If we get an update in the second-level cache then we must either invalidate or update the first level cache. If we invalidate the FLC then the block must be reloaded on a access to any word in the block. It seems that updating is better than invalidating, from a performance point of view. c. Because there is no sharing at the user level. Update protocols are only good for cases of intense read/ write sharing. d. The good thing about page-level adaptation of the protocol is that it is easy to implement. For example, one of the state bits in the page table and TLB may be an update vs invalidate bit. The memory access first reaches the TLB where the bit is looked up and the access is treating accordingly. On the other hand the special opcodes are a hassle because they require extension of the instruction set. However the same effect could be obtained by using one address bit to specify the mode. If the compiler can effectively discernate between writes that should update and writes that should invalidate the opcode approach is much more precise and effective and the page-based approach is too coarse.
5 Problem 5 Problem 5.6 in CSG (a) There are many SC interleaving possible here so we ll just take each possible output of the program and verify that there is at least one SC interleaving producing each result. There are 8 possible program outcome: 1. (u,v,w)=(0,0,0) This can be obtained by the following SC iinterleaving: L 2 A (0) =>L3 B (0) =>L3 A (0) =>S1 A (1)=>S2 B (1) 2. (u,v,w)=(0,0,1) This can be obtained by the following SC iinterleaving: L 2 A (0) =>L3 B (0) =>S1 A (1)=>L3 A (1) =>S2 B (1) 3. (u,v,w)=(0,1,0) L 2 A (0) =>S2 B (1)=>L3 B (1)=>L3 A (0) =>S1 A (1) 4. (u,v,w)=(0,1,1) L 2 A (0) =>S2 B (1)=>L3 B (1)=>S1 A (1)=>L3 A (1) 5. (u,v,w)=(1,0,0) L 3 B(0)=>L 3 A(0)=>S 1 A(1)=>L 2 A(1) =>S 2 B(1) 6. (u,v,w)=(1,0,1) L 3 B (0)=>S1 A (1)=>L2 A (1) =>S2 B (1)=>L3 A (1) 7. (u,v,w)=(1,1,1) S 1 A (1)=>L2 A (1) =>S2 B (1)=>L3 B (1)=>L3 A (1) 8. (u,v,w)=(1,1,0) This is impossible. This can be proven as follows. SC must conform to both coherence order (co) and program order (po). Because of co, we must have S 1 A (1)=>L2 A (1), S1 A (1)=>L3 A (1) and S2 B (1)=>L3 B (1). Moreover because of po, we must have L 3 B (1) =>L3 A (1) and L2 A (1) =>S2 B (1). So: S 1 A (1)=>L2 A (1)=>S2 B (1)=>L3 B (1)=>L3 A (1). Then the co order S1 A (1)=>L3 A (1) creates a cycle in the execution graph and execution cannot be SC. Basically, P2 writes the new value of B after observing the new value of A. If P3 observes the new value of B modified by P2 and then the old value of A, P2 and P3 have observed two writes in a different order, a violation of SC. We could have discarded off the bat all outcomes ending in 00 or 11 because in both cases P3 only reads values and cannot distinguish between new and old values (its accesses can be inserted at the end or the beginning of the sequence). This would have removed 4 cases. (b) This case has 16 possible execution outcomes. Let s try to weed out some obvious ones. To do that we will look at interleavings in which the two accesses of P2 and P3 directly follow each others. So: 1. P1,P2,P3,P4 yields (1,0,1,1) 2. P1,P2,P4,P3 yields (1,0,0,1) 3. P1,P3,P2,P4 yields (1,1,1,1) 4. P1,P4,P3,P2 yields (1,1,0,1)
6 5. P2,P1,P3,P4 yields (0,0,1,1) 6. P2,P1,P4,P3 yields (0,0,0,1) 7. P2,P4,P1,P3 yields (0,0,0,0) 8. P3,P2,P1,P4 yields (0,1,1,1) 9.P3,P2,P4,P1 yields (0,1,1,0) 10. P4,P3,P1,P2 yields (1,1,0,0) Now we are down to 6 outcomes. We systematically look for the remaining possible outcomes. 11. Outcome (0,0,1,0) One possible SC execution is L 2 A (0) =>L2 B (0)=>S3 B (1)=>L4 B (1)=>L4 A (0)=>S1 A (1) 12. Outcome (0,1,0,0) One possible SC execution is L 2 A (0) =>L4 B (0)=>S3 B (1)=>L2 B (1) =>L4 A (0)=>S1 A (1) 13. Outcome (0,1,0,1) One possible SC execution is L 2 A (0) =>L4 B (0)=>S3 B (1)=>L2 B (1) =>S1 A (1)=>L4 A (1) 14. Outcome (1,1,1,0) One possible SC execution is S 3 B (1)=>L4 B (1)=>L4 A (0) =>S1 A (1)=>L2 A (1) =>L2 B (1) 15. Outcome (1,0,0,0) One possible SC execution is L 4 B (0)=>L4 A (0)=>S1 A (1)=>L2 A (1) =>L2 B (0) =>S3 B (1) 16. Outcome (1,0,1,0) This is impossible. This can be proven as follows. SC must conform to both coherence order (co) and program order (po). Because of co, we must have S 1 A (1)=>L2 A (1), L4 A (0)=>S1 A (1), S3 B (1)=>L4 B (1) and L 2 B (0)=>S3 B (1). because of po, we must have L2 A (1) =>L2 B (0) and L4 B (1) =>L4 A (0). So: S 1 A (1)=>L2 A (1) =>L2 B (0)=>L2 B (0)=>S3 B (1)=>L4 B (1)=>L4 A (0)=>S1 A (1). There is a cycle in the execution graph and execution cannot be SC. Again what happens here is that P2 and P4 observe two stores in a different order. We could have solved this problem much faster by noticing that the violations of SC could only be detected by P2 and P4 and that the only stores, by P1 and P3, are unordered. Thus, unless P2 or P4 can distinguish between the old and the new values, SC is not violated. The following outcomes could have been discarded off the bat : 1) any outcome starting with 0,0 is SC since it implies that P2 did not observe any of the two stores and so could not distinguish between old and new values. Same thing for any outcome ending with 0,0. 2) any outcome starting with 1,1 is also SC because P2 observed both new values; therefore it cannot make a difference between old and new. Same thing with any outcome ending with 1,1. This strategy would have eliminated 12 outcomes and left us with only 4 to explore further: (0101), (1010), (0110) and (1001). (c) First case: the codes for P1 and P2 are atomic (F&A) The only possible SC executions are P1,P2 or P2,P1, which means A=2 and (u,v)=(0,1) or (1,0). Other outcomes (u,v)=(1,1), (0,0) and any outcome with a value of u or v of 2 are impossible. Second case: the codes are not atomic. (0,1) and (1,0) are still possible SC outcome. (0,0) is also possible if L 1 A (0) =>L2 A (0)=>S1 A (1)=>S2 A (1) Any outcome in which one of the two values u or v is not 0 is impossible because at least one of the two loads must be performed before any one of the stores can be performed in any SC interleaving, due to program order.
7 Problem 6 Problem 5.20 in CSG The problem with this code is that one process could be delayed in the while loop after executing the fetch and add. During this delay it is possible that all other processes would go through the barrier and one of them would reach the barrier again. As a result, this delayed process would end up one iteration behind all other processes, an unintended outcome. A number of solutions are possible. One is the use of a toggle flag to differentiate between consecutive iterations. global boolean flag := true BARRIER (B: Barvariable, N: Integer) { int local_flag = not flag; if (F&A(B,1) = N-1) { B:=0; flag := local_flag; } while (flag!= local_flag) do {}; } Here again a process could get delayed in the while loop. However, now no one can update the global flag before all processes pass the barrier completely and finish their next iteration. Problem 7 Problem 5.26 in CSG P1 P2 P3 Miss Classification 1 st w0 st w7 P1 and P3 miss 2 ld w6ld w2 P1&P2 miss. P1@1:pure cold 3 ld w7 P2 miss. P2@2:cold false sharing 4 ld w2ld w0 P1&P2 miss. P1@2:cold false sharing. P2@3:cold true sharing. 5 st w2 P2upgrade;P1@4:pure capacity. 6 ld w2 P1 miss. 7 st w2ld w5ld w5 P2 miss; P1@6:pure true sharing; P2@4: capacity true sharing 8 st w5 P1 miss; 9 ld w3ld w7 P2&P3 miss;p2@7:pure capacity; P3@1:pure cold 10 ld w6ld w2 P2&P3miss;P2@9:pure capacity; P3@9:pure false sharing 11 ld w2st w7 P2&P3miss;P2@10:pure capacity; P3@10:cold true sharing 12 ld w7 P1 misses; 13 ld w2 P1 misses;p1@12: pure true sharing. 14 ld w5 P2 misses;p2@11:pure capacity 15 ld w2 P3 misses;p3@11: pure capacity.
ECE453: Advanced Computer Architecture II Homework 1
ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes
More informationLecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections
Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done)
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationECE7660 Parallel Computer Architecture. Perspective on Parallel Programming
ECE7660 Parallel Computer Architecture Perspective on Parallel Programming Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks
More informationSimulating ocean currents
Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on
More informationLecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing
Lecture 26: Multiprocessors Today s topics: Synchronization Consistency Shared memory vs message-passing 1 Constructing Locks Applications have phases (consisting of many instructions) that must be executed
More informationLecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing
Lecture 26: Multiprocessors Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationMemory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB
Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More informationEEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?
EEC 17 Computer Architecture Fall 25 Introduction Review Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology
More informationCMU /618 Exam 2 Practice Problems
CMU 15-418/618 Exam 2 Practice Problems Miscellaneous Questions A. You are working on parallelizing a matrix-vector multiplication, and try creating a result vector for each thread (p). Your code then
More informationLecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy SISD: single instruction and single data stream: uniprocessor
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationLecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:
More informationLecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs
Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model
More informationFlynn's Classification
Multiprocessors Oracles's SPARC M7-32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s. Flynn's Classification
More informationCourse Administration
Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570
More informationLecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process
Lecture 2: Parallel Programs Topics: consistency, parallel applications, parallelization process 1 Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achievable
More informationMemory Consistency and Multiprocessor Performance
Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationLecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols
Lecture: Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Future Memory Trends pin count is not increasing High memory bandwidth requires
More informationShared Memory Architectures. Approaches to Building Parallel Machines
Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationCSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much
More informationCHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN
CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY
More informationMemory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationCarnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016
1 Cache Lab Recitation 7: Oct 11 th, 2016 2 Outline Memory organization Caching Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationCaches III. CSE 351 Spring Instructor: Ruth Anderson
Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson Teaching Assistants: Dylan Johnson Kevin Bi Linxing Preston Jiang Cody Ohlsen Yufang Sun Joshua Curtis Administrivia Office Hours Changes check
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationCS516 Programming Languages and Compilers II
CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationCS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University
Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MEMORY MANAGEMENT] Shrideep Pallickara Computer Science Colorado State University MS-DOS.COM? How does performing fast
More informationLecture 10 Midterm review
Lecture 10 Midterm review Announcements The midterm is on Tue Feb 9 th in class 4Bring photo ID 4You may bring a single sheet of notebook sized paper 8x10 inches with notes on both sides (A4 OK) 4You may
More informationVirtual Memory I. Jo, Heeseung
Virtual Memory I Jo, Heeseung Today's Topics Virtual memory implementation Paging Segmentation 2 Paging Introduction Physical memory Process A Virtual memory Page 3 Page 2 Frame 11 Frame 10 Frame 9 4KB
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationLecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols
Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Modern Memory System...... PROC.. 4 DDR3 channels 64-bit data channels
More informationPick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality
Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior
More informationMemory Management! Goals of this Lecture!
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and
More informationParallel Poisson Solver in Fortran
Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first
More information1. Memory technology & Hierarchy
1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories
More informationAnnouncements. ! Previous lecture. Caches. Inf3 Computer Architecture
Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for
More informationLecture 1: Introduction
Lecture 1: Introduction ourse organization: 13 lectures on parallel architectures ~5 lectures on cache coherence, consistency ~3 lectures on TM ~2 lectures on interconnection networks ~2 lectures on large
More informationLecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols
Lecture: Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Future Memory Trends pin count is not increasing High memory bandwidth requires
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationwrite-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF
write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option
More information2-Level Page Tables. Virtual Address Space: 2 32 bytes. Offset or Displacement field in VA: 12 bits
-Level Page Tables Virtual Address (VA): bits Offset or Displacement field in VA: bits Virtual Address Space: bytes Page Size: bytes = KB Virtual Page Number field in VA: - = bits Number of Virtual Pages:
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationMemory Management. Goals of this Lecture. Motivation for Memory Hierarchy
Memory Management Goals of this Lecture Help you learn about: The memory hierarchy Spatial and temporal locality of reference Caching, at multiple levels Virtual memory and thereby How the hardware and
More informationCS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14
CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors
More informationProgramming as Successive Refinement. Partitioning for Performance
Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationEmbedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed
More informationCache and Virtual Memory Simulations
Cache and Virtual Memory Simulations Does it really matter if you pull a USB out before it safely ejects? Data structure: Cache struct Cache { }; Set *sets; int set_count; int line_count; int block_size;
More informationCS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017
CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationRecall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory
Recall: Address Space Map 13: Memory Management Biggest Virtual Address Stack (Space for local variables etc. For each nested procedure call) Sometimes Reserved for OS Stack Pointer Last Modified: 6/21/2004
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationLecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models
Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically
More informationA Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-
Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More informationLecture: Memory Technology Innovations
Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor intro 1 Modern Memory System...... PROC.. 4
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More information: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016
ETH login ID: (Please print in capital letters) Full name: 263-2300: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016 Instructions Make sure that
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationMemory Consistency. Challenges. Program order Memory access order
Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined
More informationCIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018
CIS 3207 - Operating Systems Memory Management Cache and Demand Paging Professor Qiang Zeng Spring 2018 Process switch Upon process switch what is updated in order to assist address translation? Contiguous
More informationMemory Hierarchy. Bojian Zheng CSCD70 Spring 2018
Memory Hierarchy Bojian Zheng CSCD70 Spring 2018 bojian@cs.toronto.edu 1 Memory Hierarchy From programmer s point of view, memory has infinite capacity (i.e. can store infinite amount of data) has zero
More informationChapter 8 & Chapter 9 Main Memory & Virtual Memory
Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array
More informationThe MESI State Transition Graph
Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization
More informationCaching and Buffering in HDF5
Caching and Buffering in HDF5 September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 1 Software stack Life cycle: What happens to data when it is transferred from application buffer to HDF5 file and from HDF5
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationAgenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)
7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to
More informationCSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1
CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson
More informationLecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University
Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27
More informationCIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017
CIS 5512 - Operating Systems Memory Management Cache Professor Qiang Zeng Fall 2017 Previous class What is logical address? Who use it? Describes a location in the logical memory address space Compiler
More informationCache Memory: Instruction Cache, HW/SW Interaction. Admin
Cache Memory Instruction Cache, HW/SW Interaction Computer Science 104 Admin Project Due Dec 7 Homework #5 Due November 19, in class What s Ahead Finish Caches Virtual Memory Input/Output (1 homework)
More informationCSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register
Full Name: @ucsd.edu PID: CSE-160 (Winter 2017, Kesden) Practice Midterm Exam 1. Threads, Concurrency Consider the code below: volatile int count = 0; // volatile just keeps count in mem vs register void
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three
More informationCSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches
CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches Sreepathi Pai October 4, 2017 URCS Outline Cache Performance Recap Data Layout Reuse Distance Besides the Cache Outline Cache
More information