University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

Size: px
Start display at page:

Download "University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2."

Transcription

1 University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be done statically. The main questions that must be addressed are: (i) whether to decompose in terms of rows, columns or rectangular blocks/tiles, (ii) whether to assign these units in contiguous chunks or in an interleaved manner. Let s assume that we partition the source and destination matrices in the same way, which is often what we have to do in many applications (because this is probably just a phase of an more complex application), and that a processor writes only the data in its partition of the destination matrix. The necessary elements of the source matrix must be read by each processor. Suppose we partition the destination matrix in chunks of contiguous rows, each partition being of size n/p for an n-by-n matrix and p processors. To obtain the data to be transpose into its p/n rows, a processor must obtain data from a set of n/p contiguous columns in the source matrix. Since the source matrix is also partitioned into the same sets of contiguous rows as the source matrix, the processor will obtain square chunks of size n/p-by-n/p from each other processor (and a single such square from itself). These chunks will have to be transposed in this process. There is little reason to not use contiguous partitions. For example, in an interleaved rowwise partition, when a process reads a single column from the source matrix, it does not exploit spatial locality effectively (assuming a row-major language like C); spatial locality is exploited in the row being written in the destination. This is true for both programming models, though the problem is more accute in a shared address space since the reads from the source matrix are mostly remote reads and a lot of unnecessary data transfers occur when reading cache blocks. In message-passing environments, interleaved partitions can make it somewhat more complex to transfer the data that one processor needs from another in a single message (more scattering and gathering of data are required). So contiguous partitions are better (note that the same reasoning would apply if the destination matrix was partitioned in chunks of columns and also similar effect in a column-major language such as Fortran). Other than the above spatial locality and scatter/gather considerations, it does not matter too much whether the programming model is a shared address space or message passing. One difference is that in a shared address space the transposition of the n/p-by-n/p subgrids is done during the remote reading and local writing process, with no extra copy of data, while in message passing usually the subgrid is transposed locally before transferring it, which may incur some extra overhead. Another difference is that in a shared address space the communication is receiver-initiated (via remote reads) while, in message-passing, it is usually sender-initiated (the process owning a source subgrid send it to the destination process, which then writes it locally.), so some latency is hidden from the receiver. Also a single message transfers all the data rather than transferring it cache line by cache line in a request-response fashion. In the message-passing case, the data may

2 end up in the destination cache or only in its local memory, depending how the system implements it. (b) It is called all-to-all personalized since each process communicates with every other process, but communicates different data with every other process (here a different n/p by n/p subgrid). This is unlike a p-way broadcast in which each process also sends data to every other process, but sends the same data to every other process. (c) shared-memory: int mymin = 1 + (pid*n/nprocs) int mymax = mymin + n/nprocs - 1 for (k = mymin to mymax) for (i = 0 to n-1) B[k,i] = A[i,k]; message-passing: We use two array STR and RTR to send and receive blocks of matrix A. mya <- malloc (a 2-D array of size n/nprocs by n) myb <- malloc (a 2-D array of size n/nprocs by n) STR <- malloc (a 2-D array of size n/nprocs by n/nprocs) RTR <- malloc (a 2-D array of size n/nprocs by n/nprocs) initialize (mya);/mya and myb are two /contiguous row partitions of A and B for (k = 0; k < nprocs) if (k == pid) /this takes care of the diagonal submatrix for (j = pid*nprocs; j < (pid+1)*nprocs -1) for (i = 0; i < nprocs) myb[i,j] = mya[j,i]; else for (j = 0; j < nprocs) for (i = 0; i < nprocs) STR[i,j] = mya[j,k*nprocs+i]; SEND(&STR[0,0], nprocs*nprocs*sizeof(float), k) RECEIVE(&RTR[0,0],nprocs*nprocs*sizeof(float), k) for (j = 0; j < nprocs) for (i = 0; i < nprocs) myb[k*nprocs+i,j] = RTR[i,j]; endif endfor Note that for the message passing program to work the message passing primitives must be nonblocking. Besides the issues raised above in (a) the order in which the all-to-all personalized communication is done is critical. A naive way to do it is for a given process (say process i) to start communicating with process 0 and go on to process p-1 in a for(i=0;i<p;i++) loop. This causes a lot of contention first at process 0 when all processes try to communicate with it, then at process

3 1, and so on. The solution is for process i to start with process i+1 then i+2 until p-1, then process 0, etc until process i-1. (d) Blocking is usually done for temporal locality, works if the data in a block are reused. In the matrix transpose there is no data reuse. However, blocking can still be useful since cache blocks are reused when different elemnts are accessed in them. Consider a row-major language, to be concrete. If the source matrix is traversed down a column, each consecutive read brings a new block in the cache. If the cache is too small we may start evicting blocks as we go down the same column. This is not good because the blocks will be read again when accessing the next column. With blocking we go down the column up to a point such as the accessed blocks comfortably hold in the cache. Then we start at the beginning of the next column When we have traversed the first elements of B consecutive columns, we may go back to the first column and process the following elements. Problem 2 Problem 4.11 in CSG The target problem s size is 8Kx8Kx8=512MB. Per process subarray size is 512M/256=2MB. The cache of the target system is less than that (it s 256K, one eigth of 2MB). Thus consider the next working set: three rows of the subarray in each process. The size is 8Kx3x8/256 < 1K. This is very small wrt to the cache size (<1/256). The simulated problem is 512x512x8=2M.Per processor working set is 2M/64=32K. The size of three rows is 512x3x8/64 < 256. (a) We should pick cache sizes in proportion to working set size per processor in both situations. Considering the larger working set. To scale the cache wrt to this working set we should pick a cache size of 256Kx32K/2MB=4K. A 4K cache will contain 16 times the smaller working set in the simulated system, whereas the 256K contains the smaller working set 256 times in the target system. We could consider reducing the cache size further to say 1K in the simulated system. But then the cache becomes dangerously small (see (b)). So we ll stick to 4K and we ll compensate by decreasing the block size. (b) When the cache is scaled down it could become so small that the number of blocks in the cache is very small. In this case, an 4KB cache contains 4K/128 = 32 blocks. This will exacerbate the problems due to spatial locality. Looking at capacity misses, if the cache size is smaller then the best block size must be smaller. Then looking at coherence misses, we see that the cost of communication will increase in the smaller cache. Moreover cache conflicts may also become a problem because the samller the cache the more opportunies for conflicts. (c) The selected cache is not big enough to contain the partition of the grid. However it is big enough to contain a few rows. Moreover because the array is organized as a 4-D array, false sharing is not a problem. Because we need to reload blocks as we scan the grid, one possible problem is the number of misses when reloading a row of the subarray. In the taget system the size of a row of a grid partition is 512x8/128 blocks. This will give us 32 misses when reloading a row. On the other hand the subgrid has size 64x8/128 blocks in the simulated system. This will give only 4 misses to reload a row!! To get the same number of misses we should reduce the block size by 8 in the simulated system and use 16bytes instead of 128 bytes.

4 To maintain the same amount of communication (boundary) versus computation (internal) block transfers since two blocks are communicated at the boundary and 32 internal blocks are reloaded for each row of the subgrid. As a result of adopting a block size of 16 in the simulation the number of blocks in the simulated cache is 4K/16=256. This compares with 256K/128=2K blocks in the target system. It would would be useful possibly to use a two way set associative cache in the simulated system because conflicts may be much higher in the direct-mapped simulated cache than in the direct-mapped target cache. (d) I would not try to approximate the results of the simulated run with a target run using the real workload. And the speedup numbers will probably be totally off. I would use the reduced, simulated system to evaluate possible improvements to the target machine. Problem 3 Problem 5.1 in CSG. Register allocation is a problem in a multiprocessor system. Consider the case of a processor spinning on a flag variable waiting for another processor to write that variable. If the former processor allocates the variable in a register, then it will never see the writes done to the variable by the latter and will continue to spin-wait forever. This applies to several other cases of writes to shared data not being seen by other processors since they have register-allocated these data. A possible solution is to never register-allocate shared data. But this is very restrictive. One solution that current systems use is to require that variables that should not be allocated in registers to be declared volatile, which ensures that the compiler will not register-allocate them. Problem 4 Problem 5.5 in CSG a. Update protocols become MORE preferable because communication is implemented by a sequence of an invalidation followed by a miss in an invalidate protocol whereas an update suffices in a write-update protocol. b. If we get an update in the second-level cache then we must either invalidate or update the first level cache. If we invalidate the FLC then the block must be reloaded on a access to any word in the block. It seems that updating is better than invalidating, from a performance point of view. c. Because there is no sharing at the user level. Update protocols are only good for cases of intense read/ write sharing. d. The good thing about page-level adaptation of the protocol is that it is easy to implement. For example, one of the state bits in the page table and TLB may be an update vs invalidate bit. The memory access first reaches the TLB where the bit is looked up and the access is treating accordingly. On the other hand the special opcodes are a hassle because they require extension of the instruction set. However the same effect could be obtained by using one address bit to specify the mode. If the compiler can effectively discernate between writes that should update and writes that should invalidate the opcode approach is much more precise and effective and the page-based approach is too coarse.

5 Problem 5 Problem 5.6 in CSG (a) There are many SC interleaving possible here so we ll just take each possible output of the program and verify that there is at least one SC interleaving producing each result. There are 8 possible program outcome: 1. (u,v,w)=(0,0,0) This can be obtained by the following SC iinterleaving: L 2 A (0) =>L3 B (0) =>L3 A (0) =>S1 A (1)=>S2 B (1) 2. (u,v,w)=(0,0,1) This can be obtained by the following SC iinterleaving: L 2 A (0) =>L3 B (0) =>S1 A (1)=>L3 A (1) =>S2 B (1) 3. (u,v,w)=(0,1,0) L 2 A (0) =>S2 B (1)=>L3 B (1)=>L3 A (0) =>S1 A (1) 4. (u,v,w)=(0,1,1) L 2 A (0) =>S2 B (1)=>L3 B (1)=>S1 A (1)=>L3 A (1) 5. (u,v,w)=(1,0,0) L 3 B(0)=>L 3 A(0)=>S 1 A(1)=>L 2 A(1) =>S 2 B(1) 6. (u,v,w)=(1,0,1) L 3 B (0)=>S1 A (1)=>L2 A (1) =>S2 B (1)=>L3 A (1) 7. (u,v,w)=(1,1,1) S 1 A (1)=>L2 A (1) =>S2 B (1)=>L3 B (1)=>L3 A (1) 8. (u,v,w)=(1,1,0) This is impossible. This can be proven as follows. SC must conform to both coherence order (co) and program order (po). Because of co, we must have S 1 A (1)=>L2 A (1), S1 A (1)=>L3 A (1) and S2 B (1)=>L3 B (1). Moreover because of po, we must have L 3 B (1) =>L3 A (1) and L2 A (1) =>S2 B (1). So: S 1 A (1)=>L2 A (1)=>S2 B (1)=>L3 B (1)=>L3 A (1). Then the co order S1 A (1)=>L3 A (1) creates a cycle in the execution graph and execution cannot be SC. Basically, P2 writes the new value of B after observing the new value of A. If P3 observes the new value of B modified by P2 and then the old value of A, P2 and P3 have observed two writes in a different order, a violation of SC. We could have discarded off the bat all outcomes ending in 00 or 11 because in both cases P3 only reads values and cannot distinguish between new and old values (its accesses can be inserted at the end or the beginning of the sequence). This would have removed 4 cases. (b) This case has 16 possible execution outcomes. Let s try to weed out some obvious ones. To do that we will look at interleavings in which the two accesses of P2 and P3 directly follow each others. So: 1. P1,P2,P3,P4 yields (1,0,1,1) 2. P1,P2,P4,P3 yields (1,0,0,1) 3. P1,P3,P2,P4 yields (1,1,1,1) 4. P1,P4,P3,P2 yields (1,1,0,1)

6 5. P2,P1,P3,P4 yields (0,0,1,1) 6. P2,P1,P4,P3 yields (0,0,0,1) 7. P2,P4,P1,P3 yields (0,0,0,0) 8. P3,P2,P1,P4 yields (0,1,1,1) 9.P3,P2,P4,P1 yields (0,1,1,0) 10. P4,P3,P1,P2 yields (1,1,0,0) Now we are down to 6 outcomes. We systematically look for the remaining possible outcomes. 11. Outcome (0,0,1,0) One possible SC execution is L 2 A (0) =>L2 B (0)=>S3 B (1)=>L4 B (1)=>L4 A (0)=>S1 A (1) 12. Outcome (0,1,0,0) One possible SC execution is L 2 A (0) =>L4 B (0)=>S3 B (1)=>L2 B (1) =>L4 A (0)=>S1 A (1) 13. Outcome (0,1,0,1) One possible SC execution is L 2 A (0) =>L4 B (0)=>S3 B (1)=>L2 B (1) =>S1 A (1)=>L4 A (1) 14. Outcome (1,1,1,0) One possible SC execution is S 3 B (1)=>L4 B (1)=>L4 A (0) =>S1 A (1)=>L2 A (1) =>L2 B (1) 15. Outcome (1,0,0,0) One possible SC execution is L 4 B (0)=>L4 A (0)=>S1 A (1)=>L2 A (1) =>L2 B (0) =>S3 B (1) 16. Outcome (1,0,1,0) This is impossible. This can be proven as follows. SC must conform to both coherence order (co) and program order (po). Because of co, we must have S 1 A (1)=>L2 A (1), L4 A (0)=>S1 A (1), S3 B (1)=>L4 B (1) and L 2 B (0)=>S3 B (1). because of po, we must have L2 A (1) =>L2 B (0) and L4 B (1) =>L4 A (0). So: S 1 A (1)=>L2 A (1) =>L2 B (0)=>L2 B (0)=>S3 B (1)=>L4 B (1)=>L4 A (0)=>S1 A (1). There is a cycle in the execution graph and execution cannot be SC. Again what happens here is that P2 and P4 observe two stores in a different order. We could have solved this problem much faster by noticing that the violations of SC could only be detected by P2 and P4 and that the only stores, by P1 and P3, are unordered. Thus, unless P2 or P4 can distinguish between the old and the new values, SC is not violated. The following outcomes could have been discarded off the bat : 1) any outcome starting with 0,0 is SC since it implies that P2 did not observe any of the two stores and so could not distinguish between old and new values. Same thing for any outcome ending with 0,0. 2) any outcome starting with 1,1 is also SC because P2 observed both new values; therefore it cannot make a difference between old and new. Same thing with any outcome ending with 1,1. This strategy would have eliminated 12 outcomes and left us with only 4 to explore further: (0101), (1010), (0110) and (1001). (c) First case: the codes for P1 and P2 are atomic (F&A) The only possible SC executions are P1,P2 or P2,P1, which means A=2 and (u,v)=(0,1) or (1,0). Other outcomes (u,v)=(1,1), (0,0) and any outcome with a value of u or v of 2 are impossible. Second case: the codes are not atomic. (0,1) and (1,0) are still possible SC outcome. (0,0) is also possible if L 1 A (0) =>L2 A (0)=>S1 A (1)=>S2 A (1) Any outcome in which one of the two values u or v is not 0 is impossible because at least one of the two loads must be performed before any one of the stores can be performed in any SC interleaving, due to program order.

7 Problem 6 Problem 5.20 in CSG The problem with this code is that one process could be delayed in the while loop after executing the fetch and add. During this delay it is possible that all other processes would go through the barrier and one of them would reach the barrier again. As a result, this delayed process would end up one iteration behind all other processes, an unintended outcome. A number of solutions are possible. One is the use of a toggle flag to differentiate between consecutive iterations. global boolean flag := true BARRIER (B: Barvariable, N: Integer) { int local_flag = not flag; if (F&A(B,1) = N-1) { B:=0; flag := local_flag; } while (flag!= local_flag) do {}; } Here again a process could get delayed in the while loop. However, now no one can update the global flag before all processes pass the barrier completely and finish their next iteration. Problem 7 Problem 5.26 in CSG P1 P2 P3 Miss Classification 1 st w0 st w7 P1 and P3 miss 2 ld w6ld w2 P1&P2 miss. P1@1:pure cold 3 ld w7 P2 miss. P2@2:cold false sharing 4 ld w2ld w0 P1&P2 miss. P1@2:cold false sharing. P2@3:cold true sharing. 5 st w2 P2upgrade;P1@4:pure capacity. 6 ld w2 P1 miss. 7 st w2ld w5ld w5 P2 miss; P1@6:pure true sharing; P2@4: capacity true sharing 8 st w5 P1 miss; 9 ld w3ld w7 P2&P3 miss;p2@7:pure capacity; P3@1:pure cold 10 ld w6ld w2 P2&P3miss;P2@9:pure capacity; P3@9:pure false sharing 11 ld w2st w7 P2&P3miss;P2@10:pure capacity; P3@10:cold true sharing 12 ld w7 P1 misses; 13 ld w2 P1 misses;p1@12: pure true sharing. 14 ld w5 P2 misses;p2@11:pure capacity 15 ld w2 P3 misses;p3@11: pure capacity.

ECE453: Advanced Computer Architecture II Homework 1

ECE453: Advanced Computer Architecture II Homework 1 ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes

More information

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections Lecture 18: Shared-Memory Multiprocessors Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Ocean Kernel Procedure Solve(A) begin diff = done = 0; while (!done)

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Parallelization of an Example Program

Parallelization of an Example Program Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming ECE7660 Parallel Computer Architecture Perspective on Parallel Programming Outline Motivating Problems (application case studies) Process of creating a parallel program What a simple parallel program looks

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing Lecture 26: Multiprocessors Today s topics: Synchronization Consistency Shared memory vs message-passing 1 Constructing Locks Applications have phases (consisting of many instructions) that must be executed

More information

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing Lecture 26: Multiprocessors Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work? EEC 17 Computer Architecture Fall 25 Introduction Review Review: The Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology

More information

CMU /618 Exam 2 Practice Problems

CMU /618 Exam 2 Practice Problems CMU 15-418/618 Exam 2 Practice Problems Miscellaneous Questions A. You are working on parallelizing a matrix-vector multiplication, and try creating a result vector for each thread (p). Your code then

More information

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections ) Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy SISD: single instruction and single data stream: uniprocessor

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Lecture 27: Multiprocessors Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood programming model

More information

Flynn's Classification

Flynn's Classification Multiprocessors Oracles's SPARC M7-32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s. Flynn's Classification

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process Lecture 2: Parallel Programs Topics: consistency, parallel applications, parallelization process 1 Sequential Consistency A multiprocessor is sequentially consistent if the result of the execution is achievable

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols Lecture: Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Future Memory Trends pin count is not increasing High memory bandwidth requires

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016 1 Cache Lab Recitation 7: Oct 11 th, 2016 2 Outline Memory organization Caching Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

Caches III. CSE 351 Spring Instructor: Ruth Anderson

Caches III. CSE 351 Spring Instructor: Ruth Anderson Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson Teaching Assistants: Dylan Johnson Kevin Bi Linxing Preston Jiang Cody Ohlsen Yufang Sun Joshua Curtis Administrivia Office Hours Changes check

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Mar 12 Parallelism and Shared Memory Hierarchy I Rutgers University Review: Classical Three-pass Compiler Front End IR Middle End IR

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MEMORY MANAGEMENT] Shrideep Pallickara Computer Science Colorado State University MS-DOS.COM? How does performing fast

More information

Lecture 10 Midterm review

Lecture 10 Midterm review Lecture 10 Midterm review Announcements The midterm is on Tue Feb 9 th in class 4Bring photo ID 4You may bring a single sheet of notebook sized paper 8x10 inches with notes on both sides (A4 OK) 4You may

More information

Virtual Memory I. Jo, Heeseung

Virtual Memory I. Jo, Heeseung Virtual Memory I Jo, Heeseung Today's Topics Virtual memory implementation Paging Segmentation 2 Paging Introduction Physical memory Process A Virtual memory Page 3 Page 2 Frame 11 Frame 10 Frame 9 4KB

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Modern Memory System...... PROC.. 4 DDR3 channels 64-bit data channels

More information

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior

More information

Memory Management! Goals of this Lecture!

Memory Management! Goals of this Lecture! Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

Lecture 1: Introduction

Lecture 1: Introduction Lecture 1: Introduction ourse organization: 13 lectures on parallel architectures ~5 lectures on cache coherence, consistency ~3 lectures on TM ~2 lectures on interconnection networks ~2 lectures on large

More information

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols Lecture: Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Future Memory Trends pin count is not increasing High memory bandwidth requires

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF write-through v. write-back option 1: write-through 1 write 10 to 0xABCD CPU Cache ABCD: FF RAM 11CD: 42 ABCD: FF 1 2 write-through v. write-back option 1: write-through write-through v. write-back option

More information

2-Level Page Tables. Virtual Address Space: 2 32 bytes. Offset or Displacement field in VA: 12 bits

2-Level Page Tables. Virtual Address Space: 2 32 bytes. Offset or Displacement field in VA: 12 bits -Level Page Tables Virtual Address (VA): bits Offset or Displacement field in VA: bits Virtual Address Space: bytes Page Size: bytes = KB Virtual Page Number field in VA: - = bits Number of Virtual Pages:

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy Memory Management Goals of this Lecture Help you learn about: The memory hierarchy Spatial and temporal locality of reference Caching, at multiple levels Virtual memory and thereby How the hardware and

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Cache and Virtual Memory Simulations

Cache and Virtual Memory Simulations Cache and Virtual Memory Simulations Does it really matter if you pull a USB out before it safely ejects? Data structure: Cache struct Cache { }; Set *sets; int set_count; int line_count; int block_size;

More information

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017

CS 433 Homework 5. Assigned on 11/7/2017 Due in class on 11/30/2017 CS 433 Homework 5 Assigned on 11/7/2017 Due in class on 11/30/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory Recall: Address Space Map 13: Memory Management Biggest Virtual Address Stack (Space for local variables etc. For each nested procedure call) Sometimes Reserved for OS Stack Pointer Last Modified: 6/21/2004

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Lecture: Memory Technology Innovations

Lecture: Memory Technology Innovations Lecture: Memory Technology Innovations Topics: state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor intro 1 Modern Memory System...... PROC.. 4

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016

: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016 ETH login ID: (Please print in capital letters) Full name: 263-2300: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016 Instructions Make sure that

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018 CIS 3207 - Operating Systems Memory Management Cache and Demand Paging Professor Qiang Zeng Spring 2018 Process switch Upon process switch what is updated in order to assist address translation? Contiguous

More information

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018

Memory Hierarchy. Bojian Zheng CSCD70 Spring 2018 Memory Hierarchy Bojian Zheng CSCD70 Spring 2018 bojian@cs.toronto.edu 1 Memory Hierarchy From programmer s point of view, memory has infinite capacity (i.e. can store infinite amount of data) has zero

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Caching and Buffering in HDF5

Caching and Buffering in HDF5 Caching and Buffering in HDF5 September 9, 2008 SPEEDUP Workshop - HDF5 Tutorial 1 Software stack Life cycle: What happens to data when it is transferred from application buffer to HDF5 file and from HDF5

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017 CIS 5512 - Operating Systems Memory Management Cache Professor Qiang Zeng Fall 2017 Previous class What is logical address? Who use it? Describes a location in the logical memory address space Compiler

More information

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Cache Memory: Instruction Cache, HW/SW Interaction. Admin Cache Memory Instruction Cache, HW/SW Interaction Computer Science 104 Admin Project Due Dec 7 Homework #5 Due November 19, in class What s Ahead Finish Caches Virtual Memory Input/Output (1 homework)

More information

CSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register

CSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register Full Name: @ucsd.edu PID: CSE-160 (Winter 2017, Kesden) Practice Midterm Exam 1. Threads, Concurrency Consider the code below: volatile int count = 0; // volatile just keeps count in mem vs register void

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three

More information

CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches

CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches Sreepathi Pai October 4, 2017 URCS Outline Cache Performance Recap Data Layout Reuse Distance Besides the Cache Outline Cache

More information