A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000

Size: px
Start display at page:

Download "A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000"

Transcription

1 A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2 Hongzhang Shan and Jaswinder Pal Singh Department of Computer Science Princeton University {shz, Abstract We compare the performance of three major programming models a load-store cache-coherent shared address space (CC-SAS), message passing (MP) and the segmented SHMEM model on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular and predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of platform. 1 Introduction Architectural convergence has made it common for different programming models to be supported on the same platform, either directly in hardware or via software. Three common programming models in use today are (i) explicit message passing (MP, exemplified by the Message Passing Interface or MPI standard [7]) in which both communication and replication are explicit, (ii) a cachecoherent shared address space (CC-SAS) in which both communication and replication are implicit, and (iii) the SHMEM programming model. SHMEM is like MPI in that communication and replication are explicit and usually made coarse-grained for good performance; however, unlike the send-receive pair in MPI, communication in SHMEM requires processor involvement on only one side (using put or get primitives) and SHMEM allows a process to name or specify remote data via a local name and a process identifier. On the platform side, high-performance computing is converging to mainly two types of platforms: (i) tightly-coupled multiprocessors, which increasingly support a cache-coherent shared address space in hardware, and in which the hardware support is leveraged to implement the MP and SHMEM models efficiently as well, and (ii) less tightly-coupled clusters of either uniprocessors or such tightly-coupled multiprocessors, in which all the programming models are implemented in software across nodes. From both a user s and a system designer s perspective, this state of affairs makes it important to understand the relative advantages and disadvantages of these three models, both in programmability as well as in performance, when implemented on both these types of platforms. Our focus in this paper is on the former, tightlycoupled multiprocessor platform. In particular, we examine an SGI Origin2 machine a cache-coherent distributed shared memory (DSM) machine as an aggressive representative that is widely used in high-performance computing. The tradeoffs between models depend on the nature of the applications as well. For certain classes of irregular, dynamically changing applications, it has been argued that a CC-SAS model has substantial algorithmic and ease-of-programming advantages over message passing that often translate to advantages in performance as well [12, 13]. The best implementations of such applications in the CC-SAS and MP models often look very different. While it is very important to examine the programming model question for such applications, we leave this more complex and subjective question to future work. In this paper, we restrict ourselves to applications that are either regular in their data access and communication patterns or that perform irregular accesses but do not require fine-grained dynamic replication of irregularly communicated remote data. We use applications or kernels for which the basic parallel algorithm structures are very similar across models and the amount of useful data communicated is about the same, so that differences in performance can be attributed to differences in how communication is actually performed. Within this class, we choose programs that cover many of the most interesting communication patterns, including near-neighbor and multigrid (exemplified by Ocean, a computational fluid dynamics application), regular all-to-all personalized (FFT), multicast oriented (LU), and irregular all-to-all personalized (radix sorting). In particular, we are interested in the following questions, for which our results will be summarized in Section 6: For these types of fairly regular applications, is it indeed the case that parallel algorithms can be structured in the same way for good performance in all three models? Or do we need to restructure the algorithms to match a programming model? Where are the main differences in high-level or lowlevel program orchestration? Are there substantial differences in performance under the three models? If so, where are the key bottlenecks in each case? Are they similar or different aspects of performance across models? Can these bottlenecks be alleviated by changing the implementation of the programming model, or do we need to change the algorithms or data structures substantially? If the former, does this require changes in the programming model or interface visible to the application programmer as well? The rest of the paper is organized as follows. Section 2 briefly examines some related work in comparing the message passing and

2 shared memory programming models. Section 3 describes the Origin2 platform and the three programming models. Section 4 describes the applications we used and the programming differences for them among the three models. Performance is analyzed in Section 5, which also examines methods for addressing performance bottlenecks in either the model or the application. Finally, Section 6 summarizes our key conclusions and discusses future work. 2 Related Work Previous research in comparing models has focused on CC-SAS and MP models, but not on SHMEM. It can be divided into three groups: research related to hardware-coherent shared address space systems, research related to clusters or other systems in which the CC-SAS model is implemented in software, and research related to irregular applications with naturally fine-grained, dynamic and unpredictable communication and replication needs. For the latter, which are increasingly very important, it is argued that CC- SAS when implemented efficiently in hardware has substantial ease of programming and likely performance advantages compared to MP [12, 13]. However, a proper evaluation for this class of programs requires a much more involved study of programming issues and is not our focus here. Let us examine the first two groups. For hardware-coherent systems, Ngo and Snyder [14] compared several CC-SAS programs against MP versions running on the same platform. The CC-SAS programs they used were not written well to take locality into account (i.e. were written somewhat naively ), and they found such programs to perform worse than the message passing ones. We start in this study with well-written and tuned programs for all models. Chandra et al. [3] compared MP with CC- SAS using simulators of the two programming models and examined where the programs spent their time. They found that the CC- SAS programs can perform as well as message passing programs. Important differences in their study from ours are that they examined only a single problem and machine size for each program, that their study used simulation which has limitations in accuracy (especially with regard to modeling contention) and in the ability to run large problem and machine sizes, that the hardware platform they simulated (the Thinking Machines CM-5) is now quite dated, and they used different programs with somewhat less challenging communication patterns than we do (e.g. none so challenging as FFT or Radix sorting). Another simulation study by Woo et al [] studied the impact of using a block transfer (message-passing) facility to accelerate hardware-coherent shared memory on a system that provides integrated support for block transfer. They found that block transfer did not promise to improve performance as greatly as had been expected. Both these studies examined differences in traffic generated as well. Kranz et al showed that message passing can improve the performance of certain primitive communication and synchronization operations over using cache-coherent shared memory [5]. Finally, Klaiber and Levy use both simulation and direct execution to compare message traffic (not performance) of C* data-parallel programs from which a compiler automatically generates SAS and MP versions [1]. In the second group of related work, researchers have compared the performance of message passing with the CC-SAS model implemented in software at page granularity on either older messagepassing multiprocessors or on very small-scale networks of workstations [, 11]. They found that the CC-SAS model generally performs a little worse. In contrast with these two groups of related work, our study uses well-written programs to compare modern implementations of all three major programming models on a modern hardware-coherent multiprocessor platform at a variety of problem and machine scales. 3 Platforms and Programming Models 3.1 Platform: SGI Origin2 The SGI Origin 2 is a scalable, hardware-supported, cachecoherent, non-uniform memory access machine, with perhaps the most aggressive communication architecture among such machines today. The machine we use has 64 processors, organized in nodes with two 195MHZ MIPS R1 microprocessors each. Each processor has separate KB first-level instruction and data caches, and a unified 4 MB second-level cache with 2-way associativity and a 12-byte block size. The machine has GB of main memory (512 MB per node) with a page size of Kbytes. Each pair of nodes (i.e. 4 processors) is connected to a network router. The interconnect topology across the node pairs (routers) is a hypercube. The peak point to point bandwidth between nodes is 1.6 GB/sec (total in both directions). The average uncontended read latency to access the first word of a cache line are as follows: local memory 313 ns, average of local and all remote memories on a machine this size 796 ns, and furthest remote memory 11 ns [4]. The latency grows by about 1 ns for each router hop. 3.2 Parallel Programming Models The Origin2 provides full hardware support for a cache-coherent shared address space. Other programming models like MP (here using the Message Passing Interface Standard or MPI primitives) and SHMEM are built in software but leverage the hardware support for a shared address space and efficient communication for both ease of implementation and performance, as is increasingly the case in high-end tightly-coupled multiprocessors CC-SAS In this model, remotely allocated data are accessed just like locally allocated data or data in a sequential program, using ordinary loads and stores. A load or store that misses in the cache and must be satisfied remotely communicates the data in hardware at cache block granularity, and automatically replicates it in the local cache. The transparent naming and replication provides programming simplicity, especially for dynamic, fine-grained applications. In all our parallel programs, the initial or parent process spawns off a number of child processes, one for each additional processor. These cooperating processes are assigned chunks of work using static assignment. The synchronization structures used are locks and barriers. Processes are spawned once near the beginning of the program, do their work, and then terminate at the end of the parallel part of the program MP In the message passing model, each process has only a private address space, and must communicate explicitly with other processes to access their (also private) data. Communication is done via explicit send-receive pairs, so the processess on both sides are involved. The sender specifies to whom to send the data but does not specify the destination addresses; these are specified by the matching receiver whose address space they are in. The data may have to be packed and unpacked at each end to make the transfered data contiguous and hence increase communication performance. While the MP model can be more difficult to program, more so for irregular applications, its potential advantages are better performance for coarse-grained communication and the fact that once communication is explicitly coordinated with sends and receives, synchronization is implicit in the send-receive pairs in some blocking message passing models. We began by using the vendor-optimized native MPI implementation (Message-Passing Toolkit 1.2), which was developed starting from the publicly available MPICH [9]. Both use the hardware shared address space and fast communication support to accelerate message passing. We found that the performance of the native SGI implementation and MPICH are quite comparable for our applications, especially for larger numbers of processors. We therefore selected MPICH, since its source code is available. Let us examine how it works at a high level. The MPICH implementation (like the native SGI one), is faithful to the message passing model in that application data structures are only allocated in private per-process address spaces. Only the buffers and other data structures used by the MPI library itself, to implement send and receive operations, are allocated in the shared address space. The MPI buffers are allocated during the initialization process; they include a shared packet pool for exchanging

3 control information for all messags as well as data for short messages (each packet has header and flag information as well as space for some data), and data buffer space for the data in large messages. There are three data exchange mechanisms: short, eager and rendezvous. Which mechanism is used in a particular instance is determined by the library and depends on the size of the exchanged data. All copying of data to and from packet queues and data buffers is done with the memcpy function; note that while the hardware support for load-store communication is very useful, an invalidation-based coherence protocol can make such producerconsumer communication inefficient compared to an update protocol or a hardware-supported but non-coherent shared address space. Short mode. If the message size is smaller than a certain threshold, the sender first requests a packet from the preallocated shared packet pool. The sender copies the data into the packet body itself (using memcpy), fills in the control information and then adds this packet into the incoming queue of the destination process. A receive operation checks the incoming queue and, if the corresponding packet is there then copies the data from the packet into its application data structure and releases the packet. Two other incoming queues per process, called a posted queue and an unexpected messages queue, are also used by receives to manage the flow of packets and handle the cases where a receive is posted before the data arrives. If a nonblocking or asynchronous receive is used, the wait function that is called later before the data are actually needed performs similar queue management. Eager mode. If the data length is larger than the short mode threshold but smaller than another threshold, the transfer uses eager mode. Message data are not kept in the packet queue in this case, only control information is. A send operation first requests a data buffer from shared memory space and (if successful) copies the data into the buffer using memcpy. It then requests and uses packet queues for control in much the same way as the short mode does. When the receiving side receives the packet, it obtains the buffer address from the packet and then copies the data from the buffer to its own application data structure. It then frees the packet and the buffer. Eager mode often offers the highest performance per byte transferred. Rendezvous Mode. If the message is beyond the threshold size for eager mode, or if a buffer large enough cannot be obtained from the shared buffer space for an eager-mode message, rendezvous mode is used. It is similar to eager mode, except that the data are transferred into the shared buffer not when the send operation is called but only when the send-receive match occurs (this means that a sender using noblocking sends has to be careful to not overwrite the application data too early). A large message may be partitioned by the library into many smaller messages, each of which is managed in this manner. This mode is the most robust, but it may be less efficient than the eager protocol and is not used in our applications SHMEM The SHMEM library provides the fastest interprocessor communication for large messages, using data passing and one-sided communication techniques. The two major primitives are put and get. A get is similar to a read in the CC-SAS model. In CC- SAS, an ordinary load instruction is used to fetch a cache block of remote data, and data replication is automatically supported by hardware. In SHMEM, an explicit get operation is used to copy a variable amount of data from another process (using bcopy, which does the same thing as memcpy used in MP) and explictly replicate it locally. The get operation specifies the address space (process number) from which to get (copy) the data, the local source address in that (private) address space, the size of the data to fetch, and the local destination address at which to place the fetched data. In SHMEM, there is no flat uniformly addressable shared address space or data structures that all processes can load/store to. However, the portions of the private address spaces of processes that hold the logically shared data structures are identical in their data allocation. Thus, a process refers to data in a remote process s partition of a distributed data structure by using an address as if it were referring to the corresponding location in its own partition of that data structure (and by also specifying which process s address space it is referring to), not by using a global address in the larger, logically overall shared data structure. Unlike send-receive message passing, a process can refer to local variables in another process s address space when explicitly specifying communication, but unlike CC-SAS it cannot load/store directly to those variables. A put is the dual of a get; however, each is an independent and complete way of performing data transfer. Only one of them is used per communication, and they are not used as pairs to orchestrate a data transfer as in send and receive. By providing a global segmented address space and by avoiding the need for matching send and receive operations to supply the full naming, the SHMEM model delivers significant programming simplicity over MP, even though it too does not provide fully transparent naming or replication Ṫable 1 summarizes the properties of the three models both in general and as implemented on the Origin 2. 4 Applications and Algorithms We examine applications whose CC-SAS versions are from the SPLASH-2 suite and that are within the class of applications on which we focus, choosing within this class a range of communication patterns and communication to computation ratios. The first application, FFT, uses a non-localized but regular all-to-all personalized communication pattern to perform a matrix transposition; i.e. every process communicates with every other but the data sent to the different processes is different. The communication to computation ratio is quite high and diminishes only logarithmically with problem size. The second application, Ocean, exhibits primarily nearest-neighbor patterns which are very important in practice, but in a multigrid formation rather than on a single grid. The communication to computation ratio is large for small problem sizes but diminishes rapidly with increasing problem size. The third application, Radix sorting, also uses all-to-all personalized communication but in an irregular and scattered fashion, and has a very high communication to computation ratio that is independent of problem size and number of processors. The final application, blocked LU factorization of a dense matrix, uses one-to-many non-personalized communication: The pivot block and the pivot row blocks are communicated to p processors each. However, the communication needs are relatively small compared to load imbalance. The CC-SAS programs for these applications are taken from the SPLASH-2 suite, using the best versions of each application with proper data placement. Only Radix is modified to use a prefix tree to accumulate local histograms into global histograms. The CC- SAS implementations are described in [15, 2, 6]. In the following, we only discuss the differences in the communication orchestration and implementation across models. For mostly regular applications such as these, the basic partitioning method and parallel algorithm are usually the same for the CC-SAS and MP programming models. Only, communication is usually sender-based in MP for better performance, and it is structured to communicate in larger messages as described below. We examined some of the best implementations of MP programs for these applications and kernels obtained from other scientists at a variety of sites, but our transformed SPLASH-2 programs were as good as or better than any of those under message passing. So we retained the programs we produced (they also have the benefit of being directly comparable in a node performance sense with the CC-SAS programs). When noncontiguous data have to be transferred, we pack/unpack them in the application programs themselves to avoid the buffer malloc/free overhead used by the corresponding MPI functions. The MPI functions used are MPI Send, MPI Irecv, MPI Waitall, MPI Allgather and MPI Reduce. Finally, for the SHMEM versions we restructured the MP versions to use put or get rather than send-receive pairs, and to synchronize appropriately. Packing and unpacking regularly structured data is left to the strided get and put operations, which don t have performance problems here. The choice of using get or put is based on performance first and ease of programming second, experimenting with both options in various ways to determine which one to use. Using put generally transfers the data earlier (as soon as they are produced, as with a send) and reduces latency as seen by the destination; however, using get brings data into the cache while put

4 Property CC-SAS MP SHMEM Naming Model for Remote Data shared address space none: explicit messages between private address spaces segmented, symmetric global address space with explicit operations Replication and implicit, hardware Coherence supported in caches explicit, no hardware support explicit, no hardware support leverages hardware shared address uses SAS and low latency uses SAS and low latency Hardware Support space, cache coherence, and low for comm. through shared for direct comm. latency communication buffers; doesn t need coherence doesn t need coherence Primitives used for Data Transfer load/store memcpy(*) bcopy(*) on Origin 2 Communication efficient for fixed-size, inefficient for fine-grain more efficient than MP for both Overhead fine-grain efficient for coarse-grain due to one-sided comm. explicit and separate can be implicit in the explicit and separate Synchronization from communication explicit communication from communication Performance Implicit communication Explicit communication Explicit communication Predictability so more difficult so easier so easier Table 1: Summary of the properties of the three models both in general and as implemented on the Origin 2. (*) The memcpy and bcopy routines used by MP and SHMEM differ only in parameters used, and finally call exactly the same underlying data transfer routine. does not push the data in the destination cache (it cannot do so on this and many modern machines), and using get can obtain better reuse of buffers at the destination of the data. No prefetching is used in the CC-SAS programs, although we have found that software-controlled prefetching of only remote data improves the performance of FFT by 1-15%, and does little for the other applications [1]. The dynamically scheduled processor hides some memory latency, and in the SHMEM and MP cases we use asynchronous (nonblocking) operations to try to hide their latency, with wait function calls used after these operations when necessary to wait for data to leave or arrive. Let us discuss the differences of the MP and SHMEM versions from CC-SAS for the individual applications. The partitioning of work is the same across models in all cases. 4.1 FFT In the MP implementation, the communication in the transpose phase is sender-initiated for higher performance. Each processor still communicates ( n/p) subrows of size ( n/p) to each other processor, but these subrows are disjoint in the local address space; they are therefore packed into a buffer before sending and unpacked implicitly when transposing locally at the destination. Another change we make from the CC-SAS version, based on observed performance, is that we do not use the linear, staggered way of communicating to avoid algorithmic hot-spots in the transpose. Rather, the all-to-all personalized communication is performed in p 1 loop iteratoins. In each iteration, each processor chooses a unique partner with which to exchange data bidirectionally. After the p 1 iterations, each processor has exchanged data with every other processor. We experimented with other methods, including using smaller messages (a few subrows at a time) to take advantage of the overlap between communication in the transpose and computation in the local row-wise FFTs before or after it. However, the high cost of messages and low amount of work between them ends up hurting performance. The SHMEM implementation is very similar to the MP implementation except that it uses put operations rather than send and receive (the sender-initiated put is more efficient than get here due to latency hiding). 4.2 Ocean In the MP implementation, the grids in this mostly near-neighbor application are partitioned into subgrids in the same way as in the CC-SAS program. A processor sends its upper and lower border data to its neighbors in one message each. When it communicates with its left or right neighbors, the (sub) column of data is noncontiguous and is therefore first packed locally in the application and then sent in one message to reduce communication overhead, and unpacked into the ghost subcolumn at the other end. Unlike in FFT, the SHMEM implementation uses get operations to receive border data in a receiver-initiated way, due to the advantages of get here, but uses the SHMEM strided get functions instead of packing data itself since unlike in MPI there is no performance difference here. 4.3 Radix Our MP implementation follows the same overall structure as the SPLASH-2 CC-SAS program. The first major difference is in how the global histogram is generated from local histograms. In the CC-SAS implementation, it is done using a binary prefix tree. In MPI, the fine-grained communication needed for this turns out to be very expensive. We therefore use an MPI Allgather function to collect the local histograms from all processes and make a local copy of each for all of them. Then, each process computes the global histograms locally. The performance of this phase does not affect overall performance much, which is dominated by the permutation itself. However, having all the histogram information locally greatly simplifies the later computation of parameters for the send/receive functions in the permutation phase. Another difference is that in MPI implementation, it is extremely expensive to send/receive a message for each permuted key. While the writes to contiguous locations in the destination array in the permutation phase are temporally scattered, the keys that processor i permutes into processor j s partition of the output array will end up falling into that partition in several contiguous chunks, one chunk for each radix digit. We therefore buffer the data locally to compose larger messages before sending them out, which amounts to a local permutation of the data (using the now local histograms) followed by communication. An interesting question is how to buffer and send the data. One possibility is for processor i to send only one message to each other processor j, containing all of i s keys that are destined for j. Processor j will then reorganize the data to their correct positions in its array. Alternatively, i can send each contiguously-destined chunk of keys separately to j which can receive them directly into into the correct position in its array, leading to multiple messages from each i to each j but no local data reorganization. Our experiments show that the latter performs better than the former on this machine, and we use the latter, though this bears further experiments that machine access prevented us from performing. A similar local buffering method can be used to reduce the temporal scatteredness of remote writes in the CC-SAS version, but due to the local permutation cost this does not help significantly and we do not use it. Our SHMEM Radix is created from the MP program. Since all processors know all the histogram information, due to the all-gather communication, get is used instead of put since it performs better by bringing the data directly into the cache. The symmetric arrangement of processor s partitions of the output array make this very easy to program.

5 4.4 LU In CC-SAS, each process directly fetches the pivot block data (or the needed pivot row blocks) from the owner, using load instructions. In MPI, however, the owner of a block sends it to the p other processes that need it once it is produced. The SHMEM implementation replaces the sends with get operations on whole blocks. Get is used instead of put since it brings data into the cache, as in Ocean and Radix, and enables reuse of the buffer used by the get operation. 64 SPEEDUPS SHMEM CC-SAS MP 5 Performance Analysis Let us compare the performance of the applications under the different programming models. For each application, we first examine speedups, measuring them with respect to the same sequential program for all models. Then we examine per-processor breakdowns of execution time, obtained using various tools available on the machine, to obtain more detailed insights into how time is distributed in each programming model and where the bottlenecks are. We divide the per-processor wall-clock running time into four categories: CPU busy time in computation (), CPU stall time waiting for local cache misses (LMEM), CPU stall time for sending/receiving remote data (RMEM), and CPU time spent at synchronization events (SYNC). For CC-SAS programs with their implicit data access and communications, the available tools do not allow us to distinguish LMEM time from RMEM time, so we are forced to lump them together (MEM = LMEM + RMEM). However, they can be distinguished for the other two models. In the MP model, since we are using asynchronous mode, on the receiver side the SYNC time is the time spent in MPI Waitall, waiting for the incoming messages for which receives have been posted to arrive in the packet queue, indicating that the data are ready to be copied. During this time, if new messages which are not expected arrive then the receiver will also spend some time processing those messages, but that time is counted as RMEM time. On the sender side, SYNC time is the time the sender spends on adding the control packet into the receiver s incoming queue. The RMEM time is all the time spent for MP functions (like send and receive) excluding the SYNC time. In the SHMEM model, the SYNC time is the global barrier time. The RMEM time is the time spent in get/put operations and collective communication calls; there is a little synchronization time included in these operations, but unlike MPI we do not have the source code for SHMEM and the available tools cannot tell this time apart. For a given machine size, problem size is a very important variable in comparing programming models. Generally, increasing problem size reduces the communication to computation ratio and will tend to diminish the performance differences among models. Thus, although large problems are important for large machines, it is very important to examine smaller problems too. Of course, we must be careful to pay significant attention only to those problem sizes that are realistic for the application and machine at hand, and that deliver reasonable speedups at least for one of the programming models. Our general approach is to examine a range of potentially interesting problem sizes at each machine size. That said, Figure 1 shows the speedups for FFT, OCEAN, RADIX, and LU using three different programming models for only the largest data set we have run (FFT: M double complex data, OCEAN: grid size, RADIX: 12M integers, LU : matrix). For all these four applications, the SHMEM program works quite well. The CC-SAS program is close. For MP, however, none of these four application s performance is initially satisfactory, even though we are using almost the same algorithms and data structures as in SHMEM. Let us examine why. 5.1 Improving MP Performance Consider FFT as an example. Figure 2(a) shows the time breakdown for a smaller, 2K-point data set for FFT on 64 processors. The times are extremely flat across processors, as they are for OCEAN and RADIX as well, since every processor executes nearly the same number of instructions in these applications. The LMEM time is a little imbalanced, but not the major bottleneck FFT OCEAN RADIX LU Figure 1: of FFT(M), OCEAN(25), RADIX(12M), and LU(496) for the three models on,, and 64 processors. Time(us) (a) FFT (Original) Processors (-63) (b) FFT (Direct Copy) Processors ( -- 63) SYNC RMEM LMEM Figure 2: Time breakdown for FFT under the MP model for a 2K-point on 64 processors. here. It is the RMEM time and the SYNC time that are very high and extremely unbalanced in the MP version, and which cause parallel performance to be bad. This is despite the fact that we make a special effort to avoid using rendezvous mode, since it is potentially slower than eager mode, by making the threshold large enough and allocating enough buffer space. Further analysis tells us that the problem is caused mainly by an extra copy in the send function in the MP implementation. As discussed earlier, only the buffers and other data structures used by the MP library itself to implement send and receive calls are allocated in the shared address space (in both the MPICH and SGI implementations). This means that a sending process cannot directly write data into a receiving process s data structures, since it cannot reference them directly, but can only write the data into the shared buffers from where the receiver can read them. Thus, the data are copied twice. If we can copy the data directly from the source to the destination data structures without using the buffers (the sender will no longer copy the data, only the receiver will), we may be able to improve performance by eliminating one copy. Eliminating the use of the shared buffer space has other performance benefits as well. Requesting and obtaining a buffer itself takes time. Worse still, for a large number of processors like 64, processes often compete for allocating shared memory resources, causing a lot of contention in the shared memory allocation function. This contention increases RMEM time at the sender, but also increases SYNC time at the corresponding receivers which now have to wait longer, and causes imbalances in both these time components. Since processes allocate their data in private address spaces in MP programs, eliminating the extra copy (buffering) would normally require the help of the operating system which can access both address spaces. However, since we have an underlying shared address space machine, we can achieve this goal without involving the operating system, if we modify both the application (slightly) and the message-passing library. We increase the size of the shared address space and in the application allocate all the logically shared data structures that are involved in communication in this

6 shared address space, even though the data structures are organized exactly the same way as in the original MP program (they are simply allocated in the shared rather than private address space by using a shared malloc rather than a regular one). How sends and receives are called doesn t change; however, in the MPI implementation now once the send-receive match via the packet queues establishes the source and destination addresses, either application process can directly copy data to or from the application data structures of the other (using memcpy still). In particular, in eager mode, the sender now only places the control packet into the receiver s queue, but does not apply for a shared buffer and copy data. When the match happens, the receiver copies data directly from the sender s data structures (which are in the shared address space). Of course, this means that the sender cannot modify those data structures in the interim (as in a general nonblocking asynchronous send), so additional synchronization might be needed in the application. Without buffers, rendezvous mode now works similarly to eager mode and its overhead is greatly reduced. Short mode remains the same as before since the data to be copied is small and there is no buffer allocation overhead anyway. Let us now compare the performance under different programming models for each application. We will use the improved MP (no extra copy, and lock-free queue management) in all the applications from here on to enable exploration of the remaining performance differences among models once these dominant bottlenecks are alleviated, calling it simply MP, even though it violates the pure MP model of using only private address spaces in the applications themselves. Speedup (MP vs MP-NEW) FFT OCEAN RADIX LU MP MP- NEW (a) FFT 2K (MP-NEW) (b) FFT 4M (MP-NEW) SYNC Figure 4: of MP and Improved MP for FFT(M), OCEAN(25), RADIX(12M), and LU(496) on,, 64 processors. Time(us) RMEM LMEM 2 (MP vs MP-NEW) Processors (-63) 4 Processors ( -- 63) 12 MP MP- NEW Figure 3: Time breakdown for FFT under the MP model with 2K and 4M problem sizes on 64 processors, with the new MPI implementation. Figure 2(b) shows the new per-processor breakdowns of execution time. Removing the extra copy clearly improves performance dramatically and reduces imbalances in RMEM and SYNC time. The speedups for the 2K- and M-point problem sizes have increased from 1.26 and to.17 and 55.17, respectively. However, the speedup is still lower than that of CC-SAS or SHMEM. The SYNC and RMEM time components are still high. This brings us to another major source of performance loss in the MP implementations: the locking mechanism used to manage the incoming packet queues. In the original implementations, when a process sends a message to another it obtains a lock on the latter s incoming queue, adds the control information packet into the queue, and releases the lock. When the receiver receives a message, it also has to use lock/unlock to delete the entry from the queue. This locking and contention shows up as a significant problem, especially for smaller problem sizes. Performance can be improved by using lock-free queue management, as follows. Instead of locking to add or delete a packet in a shared incoming queue for a process, a fixed packet is used to transfer control information between each pair of processes (thus, there are p 2 packet slots instead of p packet queues). A flag in this fixed packet is used to control the message flow. Note that this still provides point-to-point order among messages. The lock-free mechanism further improves the performance of FFT. Indeed, after all these changes to the MP library (mostly) and the application (a little in how data are allocated in address spaces), the performance of the MP versions is comparable with that of the equivalent CC-SAS programs, at least for this problem size. The final time breakdown is shown in Figure 3, for both this and a larger problem size. In fact, using the final improved MP implementation, we find that the performance of OCEAN, RADIX and LU is also greatly improved. The comparison of the speedups for MP and Improved MP (MP-NEW) is shown in Figure 4 for a large data set and Figure 5 for a small data set FFT OCEAN RADIX LU Figure 5: of MP and Improved MP for FFT(64K), OCEAN(25), RADIX(1M), and LU(2) on,, 64 processors. 5.2 FFT 64 FFT SHMEM CC-SAS MP K 2K 1M 4M M Figure 6: for FFT under SHMEM, CC-SAS and MP on,, 64 processors with different problem sizes. The speedups with the three programming models for different data sizes, from 64K to M double complex data, are shown in Figure 6. 1 are quite similar across models at processors for most problem sizes, with differences becoming substantial only 1 We found that for the MP and SHMEM programs, which use explicit communication, the same communication functions take much longer in the first transpose than in following transposes. Detailed profiling showed that the extra time was in the ac-

7 beyond that. Even for larger p, the performance of (the new) MP and SHMEM is quite similar on all data sets. However, compared with CC-SAS, their speedups for smaller problem sizes are much lower. With increasing problem size, their speedups improve and finally catch up with that of CC-SAS. For the M data size on 64 processors, the speedups for all models are about 6. One reason that speedup is so high for big problems is that stall time on local memory relative to busy time may be reduced greatly compared to a uniprocessor, due to local working sets fitting in the cache in a multiprocessor run while they didn t in the uniprocessor case (for small problems they fit in the cache in the uniprocessor case as well, and for very large problems they may fit in neither). Note that the inherent communication to computation ratio itself does not diminish rapidly with problem size in FFT (only logarithmically). So although message sizes become larger and amortize overheads much better, only communication does not account for the large increases in speedup with problem sizes even in MP. To illustrate this, Table 2 shows the average ratios across processors (expressed as percentages) of local memory time to busy time for two data sets, one smaller and one larger, using the MP executions as an example. The reduction in this ratio with increasing number of processors shows that the capacity-induced superlinear effect in local access is much larger for the larger data set (a two-fold reduction in the local memory time component when going from 1 to 64 processors) than for the smaller in this case. Fortunately, this effect applies about equally to all programming models, and is quite clean in our applications even in the CC-SAS case since they do not have significant capacity misses on remotely allocated data. 2 1P P P 64P 2k M Table 2: Average ratio (in percentage) of local memory time to busy time for the 2K- and 4M-point problem sizes in the MP model. Although the capacity-induced superlinear effect is real, we can ignore it by replacing the LMEM time in the uniprocessor case with the sum of the LMEM times across processors in the parallel case. The speedups calculated in this way are smaller, as shown in the no-cap entries in Table 3. For comparison, we also include the actual speedups (including capacity effects) as well. In other applications, such as OCEAN and RADIX, the superlinear capacity effect on local misses is also severe, though again fortunately similar for all models: The no-cap MP speedups for OCEAN for 25-by-25 grids and RADIX for 12M keys are 4 and 25, respectively, while their corresponding actual speedups are and 44. P P 64P 2k-no-cap k M-no-cap M Table 3: FFT speedup comparison with and without cache capacity effects for 2K- and 4M-point problem sizes on,, 64 processors. The per-processor execution time breakdowns for the 2K and 4M problem sizes on 64 processors for MP, CC-SAS, and SHMEM tual remote date movement operations like memcpy and bcopy (which are basically identical) used by these functions, which are much more expensive when invoked for the first time on a given set of pages. This page-level cold-start effect, which is not substantial in the CC-SAS case, is eliminated in our results by having each processor in advance touch all remote pages it may need to communicate with later. Simply doing a single load-store reference to each such page using the machine s shared address space support, suffices, though many other methods will do. This touching is done before our timing measurements begin. This cold-start problem on pages is large only for kernels with a lot of communicated pages like FFT and Radix (as we will see later); real applications that use these kernels may use them multiple times, amortizing this cost. 2 CC-SAS has another small problem for making this measurement in that some of the local misses in the transpose are converted to remote misses, but MP and SHMEM do not have this problem since all transposition is done locally separately from communication. are shown in Figures 3, 7 and, respectively. The time for SHMEM and MP for each problem size is almost the same, and is a little higher than that of CC-SAS. This is primarily due to the extra packing and unpacking operation needed in SHMEM and MP programs, in which the (noncontiguous) sub-rows of a transferred n/p by n/p patch are packed contiguously before they are sent out and unpacked after they have arrived at the destination. In CC- SAS, on the other hand, the data are read individually at the fine granularity of cache lines, so there is no need to pack and unpack the data. This difference is imposed by the performance-driven need to make messages larger in SHMEM and MP. Time (us) (a) FFT 2K (CC-SAS) 4 Processors ( -- 63) (b) FFT 4M (CC-SAS) 4 Processors ( -- 63) BARRIER MEM Figure 7: Time breakdown for FFT under the CC-SAS model with 2K and 4M point problems on 64 processors. Time(us) (a) FFT 2K (SHMEM) 4 Processors (-63) (b) FFT 4M (SHMEM) 4 Processors (-63) SYNC RMEM LMEM Figure : Time breakdown for FFT under the SHMEM model with 2K and 4M point problems on 64 processors. The main differences among models for FFT lies in the data access stall components. The CC-SAS model has a much lower MEM time than the others for smaller data sets and larger p. Recall that we have to lump LMEM and RMEM together in this model since they cannot be separated by the available tools. When n/p is small, so are the messages in MP and SHMEM, so message overhead (of software management in MP as well as of the basic data transfer operations used by both MPI and SHMEM) is not well amortized. This is further worse in MP than in SHMEM, both since the producer-consumer communication needed for (control) packet queue management is a poor match for the underlying invalidation-based cache coherence protocol and especially since an explicit send and a matching receive must be initiated separately for each communication. The latter potentially increases not only messaging overhead and end-point contention further but also synchronization time, since the sends and receives have to be posted in timely ways and matched. In the CC-SAS model, the transfers of cache blocks triggered by loads and stores are very efficient for the fine-grained communication needed. With automatic hardware caching, the data fetched also arrive in the cache rather than in main memory, and can be used very efficiently locally. As n/p increases, message size increases and explicit communication with send-receive or (more so) put/get becomes more efficient, so the performance of MP and SHMEM equals that of CC-SAS. Finally, consider the difference between MEM times in SHMEM and MP. SHMEM s RMEM time is less than that of MP because

8 its one-sided communication is more efficient as discussed above. But, surprisingly, it has a much higher LMEM time. Through further analysis, we find that the greater LMEM time is spent in the transpose phases, specifically during the local data movement needed to unpack the deposited or received data, i.e. extract the subrows of the transferred square blocks and move them to the correct transposed positions in the local matrix. The data transfer operation we use in SHMEM is put. Unlike with a receive in MP, where the data that are moved to the receiver s data structures are also placed in its cache, the bcopy in a put places the data in the receiver s main memory but not in its cache (see Section 4). This means that when unpacking the data, the MP code reads the data out of cache while the SHMEM code reads it from main memory, increasing local memory stall time. We verify this by measuring the unpacking separately, as well as by having the destination process of the put touch the buffer before unpacking (thus bringing the data into the cache), in which case the unpacking is as fast as in the MP version. Using get instead of put helps with the caching issue, but communication latency is not hidden well and hence synchronization time increases, so overall there is not much difference in performance in FFT. Note that in CC-SAS the transposition of data is done as part of the load-store communication itself, and the data are brought into the destination processor s cache. 5.3 OCEAN Ocean has many grid computations in each time-step that use many different grids, some of which involve near-neighbor communication and some of which involve no communication. A large fraction of the execution time is spent in a multigrid equation solver in each time-step, which exhibits nearest-neighbor communication but at various levels of a hierarchy of successively smaller grids. OCEAN SHMEM CC-SAS MPI The time breakdowns for the intermediate, 126-by-126 grid size on 64 processors are shown in the Figure 1. The times are very balanced and similar. The MEM time in all three cases is imbalanced, but is much higher and more imbalanced in CC-SAS. There are several likely reasons for this behavior of CC-SAS relative to SHMEM and MP, which we unfortunately cannot determine easily because of the lack of available tools (note that the LMEM category for CC-SAS is actually MEM = LMEM + RMEM). One is poor spatial locality at cache block granularity for remote access at the column-oriented boundaries of the square partitions: Only one boundary word is needed in each row, but a whole cache block is fetched. This poor spatial locality is on local rather than remote accesses for MP and SHMEM since they pack the data contiguously locally before communicating. Another likely possibility is that local capacity misses behave differently across programming models. In MP and SHMEM, a process s partitions of all the different grids are allocated contiguously in its private address space, while in CC-SAS each entire grid is allocated as a large contiguous shared array in the shared address space; even though a process s partition of each grid is contiguous due to the use of 4-D arrays, there is a very large gap between a processor s partitions of different grids in the data layout. This causes many more and imbalanced local conflict misses in Ocean since multiple grids are accessed together in many computations. A third possibility is that perhaps certain kinds of data and and pointer arrays have not been properly placed among the distributed memories in CC-SAS, though they have to be in MP and SHMEM, and these become relatively more of an issue at smaller problem sizes. (We already obtained a great improvement compared to our original program by placing some data structures better; perhaps more work can be done on this, though the lack of appropriate information from the machine makes the problems difficult to diagnose.) Larger problem sizes and smaller machines make local capacity misses dominate, so the difference between models is small. 5.4 RADIX 64 RADIX SHMEM CC-SAS MPI Figure 9: for OCEAN under SHMEM, CC-SAS and MP on,, 64 processors with different problem sizes. The speedups for OCEAN are shown in Figure 9. For processors, the speedups in the three programming models are similar; however, there are large differences for larger processor counts; in particular, the performance of CC-SAS is now much worse for smaller problem sizes (the opposite of the FFT situation). Time (us) (a) OCEAN (CC-SAS) Processors ( -- 63) (b) OCEAN (MP) Processors ( - 63) (c) OCEAN (SHMEM) Processors ( - 63) SYNC RMEM LMEM Figure 1: Time breakdown for OCEAN (126) on 64 processors M 4M M 64M 12M Figure 11: for RADIX under SHMEM, CC-SAS and MP on,, 64 processors with different problem sizes. The speedups are shown in Figure 11. Unlike in FFT and OCEAN, no model performs very well for data sets smaller than 64M integer keys, though SHMEM is much better than the others. On larger data sets, the three become closer, though SHMEM is still the best followed by MP. The per-processor time breakdowns for the largest, 12M-key problem on 64 processors are shown in Figures 12. In all models, time is very small, although there is some increase in SHMEM and MP due to the additional local work in the permutation (see Section 4). The MEM time is very high, including both LMEM and RMEM. 3 Communication is very bursty and there is 3 Similarly to FFT, for the MP and SHMEM programs the communication operations take much longer in the first communication (permutation) phase Touching remote pages in advance solves this problem, though unlike FFT it is unpredictable which remote pages will be communicated with so all in the whole logical array are touched.

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Programming as Successive Refinement. Partitioning for Performance

Programming as Successive Refinement. Partitioning for Performance Programming as Successive Refinement Not all issues dealt with up front Partitioning often independent of architecture, and done first View machine as a collection of communicating processors balancing

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters

The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters Angelos Bilas and Jaswinder Pal Singh Department of Computer Science Olden Street Princeton University Princeton,

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems

Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems Angelos Bilas Dept. of Elec. and Comp. Eng. 10 King s College Road University of Toronto Toronto,

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Steps in Creating a Parallel Program

Steps in Creating a Parallel Program Computational Problem Steps in Creating a Parallel Program Parallel Algorithm D e c o m p o s i t i o n Partitioning A s s i g n m e n t p 0 p 1 p 2 p 3 p 0 p 1 p 2 p 3 Sequential Fine-grain Tasks Parallel

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced? Chapter 10: Virtual Memory Questions? CSCI [4 6] 730 Operating Systems Virtual Memory!! What is virtual memory and when is it useful?!! What is demand paging?!! When should pages in memory be replaced?!!

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Three parallel-programming models

Three parallel-programming models Three parallel-programming models Shared-memory programming is like using a bulletin board where you can communicate with colleagues. essage-passing is like communicating via e-mail or telephone calls.

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Memory Management! Goals of this Lecture!

Memory Management! Goals of this Lecture! Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy Memory Management Goals of this Lecture Help you learn about: The memory hierarchy Spatial and temporal locality of reference Caching, at multiple levels Virtual memory and thereby How the hardware and

More information

Conventional Computer Architecture. Abstraction

Conventional Computer Architecture. Abstraction Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction

More information

Memory Hierarchies &

Memory Hierarchies & Memory Hierarchies & Cache Memory CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 4/26/2009 cse410-13-cache 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

arxiv: v1 [cs.dc] 27 Sep 2018

arxiv: v1 [cs.dc] 27 Sep 2018 Performance of MPI sends of non-contiguous data Victor Eijkhout arxiv:19.177v1 [cs.dc] 7 Sep 1 1 Abstract We present an experimental investigation of the performance of MPI derived datatypes. For messages

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Distributed Shared Memory

Distributed Shared Memory Distributed Shared Memory History, fundamentals and a few examples Coming up The Purpose of DSM Research Distributed Shared Memory Models Distributed Shared Memory Timeline Three example DSM Systems The

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

1 Connectionless Routing

1 Connectionless Routing UCSD DEPARTMENT OF COMPUTER SCIENCE CS123a Computer Networking, IP Addressing and Neighbor Routing In these we quickly give an overview of IP addressing and Neighbor Routing. Routing consists of: IP addressing

More information

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are:

Introduction CHAPTER. Practice Exercises. 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are: 1 CHAPTER Introduction Practice Exercises 1.1 What are the three main purposes of an operating system? Answer: The three main puropses are: To provide an environment for a computer user to execute programs

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Performance Optimization Part II: Locality, Communication, and Contention

Performance Optimization Part II: Locality, Communication, and Contention Lecture 7: Performance Optimization Part II: Locality, Communication, and Contention Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Beth Rowley Nobody s Fault but Mine

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 233 6.2 Types of Memory 233 6.3 The Memory Hierarchy 235 6.3.1 Locality of Reference 237 6.4 Cache Memory 237 6.4.1 Cache Mapping Schemes 239 6.4.2 Replacement Policies 247

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren nmm1@cam.ac.uk March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous set of practical points Over--simplifies

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information