A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000

Size: px

Start display at page:

Download "A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2000"

Lucy Randall
5 years ago
Views:

1 A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming Models on the SGI Origin2 Hongzhang Shan and Jaswinder Pal Singh Department of Computer Science Princeton University {shz, Abstract We compare the performance of three major programming models a load-store cache-coherent shared address space (CC-SAS), message passing (MP) and the segmented SHMEM model on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular and predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of platform. 1 Introduction Architectural convergence has made it common for different programming models to be supported on the same platform, either directly in hardware or via software. Three common programming models in use today are (i) explicit message passing (MP, exemplified by the Message Passing Interface or MPI standard [7]) in which both communication and replication are explicit, (ii) a cachecoherent shared address space (CC-SAS) in which both communication and replication are implicit, and (iii) the SHMEM programming model. SHMEM is like MPI in that communication and replication are explicit and usually made coarse-grained for good performance; however, unlike the send-receive pair in MPI, communication in SHMEM requires processor involvement on only one side (using put or get primitives) and SHMEM allows a process to name or specify remote data via a local name and a process identifier. On the platform side, high-performance computing is converging to mainly two types of platforms: (i) tightly-coupled multiprocessors, which increasingly support a cache-coherent shared address space in hardware, and in which the hardware support is leveraged to implement the MP and SHMEM models efficiently as well, and (ii) less tightly-coupled clusters of either uniprocessors or such tightly-coupled multiprocessors, in which all the programming models are implemented in software across nodes. From both a user s and a system designer s perspective, this state of affairs makes it important to understand the relative advantages and disadvantages of these three models, both in programmability as well as in performance, when implemented on both these types of platforms. Our focus in this paper is on the former, tightlycoupled multiprocessor platform. In particular, we examine an SGI Origin2 machine a cache-coherent distributed shared memory (DSM) machine as an aggressive representative that is widely used in high-performance computing. The tradeoffs between models depend on the nature of the applications as well. For certain classes of irregular, dynamically changing applications, it has been argued that a CC-SAS model has substantial algorithmic and ease-of-programming advantages over message passing that often translate to advantages in performance as well [12, 13]. The best implementations of such applications in the CC-SAS and MP models often look very different. While it is very important to examine the programming model question for such applications, we leave this more complex and subjective question to future work. In this paper, we restrict ourselves to applications that are either regular in their data access and communication patterns or that perform irregular accesses but do not require fine-grained dynamic replication of irregularly communicated remote data. We use applications or kernels for which the basic parallel algorithm structures are very similar across models and the amount of useful data communicated is about the same, so that differences in performance can be attributed to differences in how communication is actually performed. Within this class, we choose programs that cover many of the most interesting communication patterns, including near-neighbor and multigrid (exemplified by Ocean, a computational fluid dynamics application), regular all-to-all personalized (FFT), multicast oriented (LU), and irregular all-to-all personalized (radix sorting). In particular, we are interested in the following questions, for which our results will be summarized in Section 6: For these types of fairly regular applications, is it indeed the case that parallel algorithms can be structured in the same way for good performance in all three models? Or do we need to restructure the algorithms to match a programming model? Where are the main differences in high-level or lowlevel program orchestration? Are there substantial differences in performance under the three models? If so, where are the key bottlenecks in each case? Are they similar or different aspects of performance across models? Can these bottlenecks be alleviated by changing the implementation of the programming model, or do we need to change the algorithms or data structures substantially? If the former, does this require changes in the programming model or interface visible to the application programmer as well? The rest of the paper is organized as follows. Section 2 briefly examines some related work in comparing the message passing and

2 shared memory programming models. Section 3 describes the Origin2 platform and the three programming models. Section 4 describes the applications we used and the programming differences for them among the three models. Performance is analyzed in Section 5, which also examines methods for addressing performance bottlenecks in either the model or the application. Finally, Section 6 summarizes our key conclusions and discusses future work. 2 Related Work Previous research in comparing models has focused on CC-SAS and MP models, but not on SHMEM. It can be divided into three groups: research related to hardware-coherent shared address space systems, research related to clusters or other systems in which the CC-SAS model is implemented in software, and research related to irregular applications with naturally fine-grained, dynamic and unpredictable communication and replication needs. For the latter, which are increasingly very important, it is argued that CC- SAS when implemented efficiently in hardware has substantial ease of programming and likely performance advantages compared to MP [12, 13]. However, a proper evaluation for this class of programs requires a much more involved study of programming issues and is not our focus here. Let us examine the first two groups. For hardware-coherent systems, Ngo and Snyder [14] compared several CC-SAS programs against MP versions running on the same platform. The CC-SAS programs they used were not written well to take locality into account (i.e. were written somewhat naively ), and they found such programs to perform worse than the message passing ones. We start in this study with well-written and tuned programs for all models. Chandra et al. [3] compared MP with CC- SAS using simulators of the two programming models and examined where the programs spent their time. They found that the CC- SAS programs can perform as well as message passing programs. Important differences in their study from ours are that they examined only a single problem and machine size for each program, that their study used simulation which has limitations in accuracy (especially with regard to modeling contention) and in the ability to run large problem and machine sizes, that the hardware platform they simulated (the Thinking Machines CM-5) is now quite dated, and they used different programs with somewhat less challenging communication patterns than we do (e.g. none so challenging as FFT or Radix sorting). Another simulation study by Woo et al [] studied the impact of using a block transfer (message-passing) facility to accelerate hardware-coherent shared memory on a system that provides integrated support for block transfer. They found that block transfer did not promise to improve performance as greatly as had been expected. Both these studies examined differences in traffic generated as well. Kranz et al showed that message passing can improve the performance of certain primitive communication and synchronization operations over using cache-coherent shared memory [5]. Finally, Klaiber and Levy use both simulation and direct execution to compare message traffic (not performance) of C* data-parallel programs from which a compiler automatically generates SAS and MP versions [1]. In the second group of related work, researchers have compared the performance of message passing with the CC-SAS model implemented in software at page granularity on either older messagepassing multiprocessors or on very small-scale networks of workstations [, 11]. They found that the CC-SAS model generally performs a little worse. In contrast with these two groups of related work, our study uses well-written programs to compare modern implementations of all three major programming models on a modern hardware-coherent multiprocessor platform at a variety of problem and machine scales. 3 Platforms and Programming Models 3.1 Platform: SGI Origin2 The SGI Origin 2 is a scalable, hardware-supported, cachecoherent, non-uniform memory access machine, with perhaps the most aggressive communication architecture among such machines today. The machine we use has 64 processors, organized in nodes with two 195MHZ MIPS R1 microprocessors each. Each processor has separate KB first-level instruction and data caches, and a unified 4 MB second-level cache with 2-way associativity and a 12-byte block size. The machine has GB of main memory (512 MB per node) with a page size of Kbytes. Each pair of nodes (i.e. 4 processors) is connected to a network router. The interconnect topology across the node pairs (routers) is a hypercube. The peak point to point bandwidth between nodes is 1.6 GB/sec (total in both directions). The average uncontended read latency to access the first word of a cache line are as follows: local memory 313 ns, average of local and all remote memories on a machine this size 796 ns, and furthest remote memory 11 ns [4]. The latency grows by about 1 ns for each router hop. 3.2 Parallel Programming Models The Origin2 provides full hardware support for a cache-coherent shared address space. Other programming models like MP (here using the Message Passing Interface Standard or MPI primitives) and SHMEM are built in software but leverage the hardware support for a shared address space and efficient communication for both ease of implementation and performance, as is increasingly the case in high-end tightly-coupled multiprocessors CC-SAS In this model, remotely allocated data are accessed just like locally allocated data or data in a sequential program, using ordinary loads and stores. A load or store that misses in the cache and must be satisfied remotely communicates the data in hardware at cache block granularity, and automatically replicates it in the local cache. The transparent naming and replication provides programming simplicity, especially for dynamic, fine-grained applications. In all our parallel programs, the initial or parent process spawns off a number of child processes, one for each additional processor. These cooperating processes are assigned chunks of work using static assignment. The synchronization structures used are locks and barriers. Processes are spawned once near the beginning of the program, do their work, and then terminate at the end of the parallel part of the program MP In the message passing model, each process has only a private address space, and must communicate explicitly with other processes to access their (also private) data. Communication is done via explicit send-receive pairs, so the processess on both sides are involved. The sender specifies to whom to send the data but does not specify the destination addresses; these are specified by the matching receiver whose address space they are in. The data may have to be packed and unpacked at each end to make the transfered data contiguous and hence increase communication performance. While the MP model can be more difficult to program, more so for irregular applications, its potential advantages are better performance for coarse-grained communication and the fact that once communication is explicitly coordinated with sends and receives, synchronization is implicit in the send-receive pairs in some blocking message passing models. We began by using the vendor-optimized native MPI implementation (Message-Passing Toolkit 1.2), which was developed starting from the publicly available MPICH [9]. Both use the hardware shared address space and fast communication support to accelerate message passing. We found that the performance of the native SGI implementation and MPICH are quite comparable for our applications, especially for larger numbers of processors. We therefore selected MPICH, since its source code is available. Let us examine how it works at a high level. The MPICH implementation (like the native SGI one), is faithful to the message passing model in that application data structures are only allocated in private per-process address spaces. Only the buffers and other data structures used by the MPI library itself, to implement send and receive operations, are allocated in the shared address space. The MPI buffers are allocated during the initialization process; they include a shared packet pool for exchanging

3 control information for all messags as well as data for short messages (each packet has header and flag information as well as space for some data), and data buffer space for the data in large messages. There are three data exchange mechanisms: short, eager and rendezvous. Which mechanism is used in a particular instance is determined by the library and depends on the size of the exchanged data. All copying of data to and from packet queues and data buffers is done with the memcpy function; note that while the hardware support for load-store communication is very useful, an invalidation-based coherence protocol can make such producerconsumer communication inefficient compared to an update protocol or a hardware-supported but non-coherent shared address space. Short mode. If the message size is smaller than a certain threshold, the sender first requests a packet from the preallocated shared packet pool. The sender copies the data into the packet body itself (using memcpy), fills in the control information and then adds this packet into the incoming queue of the destination process. A receive operation checks the incoming queue and, if the corresponding packet is there then copies the data from the packet into its application data structure and releases the packet. Two other incoming queues per process, called a posted queue and an unexpected messages queue, are also used by receives to manage the flow of packets and handle the cases where a receive is posted before the data arrives. If a nonblocking or asynchronous receive is used, the wait function that is called later before the data are actually needed performs similar queue management. Eager mode. If the data length is larger than the short mode threshold but smaller than another threshold, the transfer uses eager mode. Message data are not kept in the packet queue in this case, only control information is. A send operation first requests a data buffer from shared memory space and (if successful) copies the data into the buffer using memcpy. It then requests and uses packet queues for control in much the same way as the short mode does. When the receiving side receives the packet, it obtains the buffer address from the packet and then copies the data from the buffer to its own application data structure. It then frees the packet and the buffer. Eager mode often offers the highest performance per byte transferred. Rendezvous Mode. If the message is beyond the threshold size for eager mode, or if a buffer large enough cannot be obtained from the shared buffer space for an eager-mode message, rendezvous mode is used. It is similar to eager mode, except that the data are transferred into the shared buffer not when the send operation is called but only when the send-receive match occurs (this means that a sender using noblocking sends has to be careful to not overwrite the application data too early). A large message may be partitioned by the library into many smaller messages, each of which is managed in this manner. This mode is the most robust, but it may be less efficient than the eager protocol and is not used in our applications SHMEM The SHMEM library provides the fastest interprocessor communication for large messages, using data passing and one-sided communication techniques. The two major primitives are put and get. A get is similar to a read in the CC-SAS model. In CC- SAS, an ordinary load instruction is used to fetch a cache block of remote data, and data replication is automatically supported by hardware. In SHMEM, an explicit get operation is used to copy a variable amount of data from another process (using bcopy, which does the same thing as memcpy used in MP) and explictly replicate it locally. The get operation specifies the address space (process number) from which to get (copy) the data, the local source address in that (private) address space, the size of the data to fetch, and the local destination address at which to place the fetched data. In SHMEM, there is no flat uniformly addressable shared address space or data structures that all processes can load/store to. However, the portions of the private address spaces of processes that hold the logically shared data structures are identical in their data allocation. Thus, a process refers to data in a remote process s partition of a distributed data structure by using an address as if it were referring to the corresponding location in its own partition of that data structure (and by also specifying which process s address space it is referring to), not by using a global address in the larger, logically overall shared data structure. Unlike send-receive message passing, a process can refer to local variables in another process s address space when explicitly specifying communication, but unlike CC-SAS it cannot load/store directly to those variables. A put is the dual of a get; however, each is an independent and complete way of performing data transfer. Only one of them is used per communication, and they are not used as pairs to orchestrate a data transfer as in send and receive. By providing a global segmented address space and by avoiding the need for matching send and receive operations to supply the full naming, the SHMEM model delivers significant programming simplicity over MP, even though it too does not provide fully transparent naming or replication Ṫable 1 summarizes the properties of the three models both in general and as implemented on the Origin 2. 4 Applications and Algorithms We examine applications whose CC-SAS versions are from the SPLASH-2 suite and that are within the class of applications on which we focus, choosing within this class a range of communication patterns and communication to computation ratios. The first application, FFT, uses a non-localized but regular all-to-all personalized communication pattern to perform a matrix transposition; i.e. every process communicates with every other but the data sent to the different processes is different. The communication to computation ratio is quite high and diminishes only logarithmically with problem size. The second application, Ocean, exhibits primarily nearest-neighbor patterns which are very important in practice, but in a multigrid formation rather than on a single grid. The communication to computation ratio is large for small problem sizes but diminishes rapidly with increasing problem size. The third application, Radix sorting, also uses all-to-all personalized communication but in an irregular and scattered fashion, and has a very high communication to computation ratio that is independent of problem size and number of processors. The final application, blocked LU factorization of a dense matrix, uses one-to-many non-personalized communication: The pivot block and the pivot row blocks are communicated to p processors each. However, the communication needs are relatively small compared to load imbalance. The CC-SAS programs for these applications are taken from the SPLASH-2 suite, using the best versions of each application with proper data placement. Only Radix is modified to use a prefix tree to accumulate local histograms into global histograms. The CC- SAS implementations are described in [15, 2, 6]. In the following, we only discuss the differences in the communication orchestration and implementation across models. For mostly regular applications such as these, the basic partitioning method and parallel algorithm are usually the same for the CC-SAS and MP programming models. Only, communication is usually sender-based in MP for better performance, and it is structured to communicate in larger messages as described below. We examined some of the best implementations of MP programs for these applications and kernels obtained from other scientists at a variety of sites, but our transformed SPLASH-2 programs were as good as or better than any of those under message passing. So we retained the programs we produced (they also have the benefit of being directly comparable in a node performance sense with the CC-SAS programs). When noncontiguous data have to be transferred, we pack/unpack them in the application programs themselves to avoid the buffer malloc/free overhead used by the corresponding MPI functions. The MPI functions used are MPI Send, MPI Irecv, MPI Waitall, MPI Allgather and MPI Reduce. Finally, for the SHMEM versions we restructured the MP versions to use put or get rather than send-receive pairs, and to synchronize appropriately. Packing and unpacking regularly structured data is left to the strided get and put operations, which don t have performance problems here. The choice of using get or put is based on performance first and ease of programming second, experimenting with both options in various ways to determine which one to use. Using put generally transfers the data earlier (as soon as they are produced, as with a send) and reduces latency as seen by the destination; however, using get brings data into the cache while put

4 Property CC-SAS MP SHMEM Naming Model for Remote Data shared address space none: explicit messages between private address spaces segmented, symmetric global address space with explicit operations Replication and implicit, hardware Coherence supported in caches explicit, no hardware support explicit, no hardware support leverages hardware shared address uses SAS and low latency uses SAS and low latency Hardware Support space, cache coherence, and low for comm. through shared for direct comm. latency communication buffers; doesn t need coherence doesn t need coherence Primitives used for Data Transfer load/store memcpy(*) bcopy(*) on Origin 2 Communication efficient for fixed-size, inefficient for fine-grain more efficient than MP for both Overhead fine-grain efficient for coarse-grain due to one-sided comm. explicit and separate can be implicit in the explicit and separate Synchronization from communication explicit communication from communication Performance Implicit communication Explicit communication Explicit communication Predictability so more difficult so easier so easier Table 1: Summary of the properties of the three models both in general and as implemented on the Origin 2. (*) The memcpy and bcopy routines used by MP and SHMEM differ only in parameters used, and finally call exactly the same underlying data transfer routine. does not push the data in the destination cache (it cannot do so on this and many modern machines), and using get can obtain better reuse of buffers at the destination of the data. No prefetching is used in the CC-SAS programs, although we have found that software-controlled prefetching of only remote data improves the performance of FFT by 1-15%, and does little for the other applications [1]. The dynamically scheduled processor hides some memory latency, and in the SHMEM and MP cases we use asynchronous (nonblocking) operations to try to hide their latency, with wait function calls used after these operations when necessary to wait for data to leave or arrive. Let us discuss the differences of the MP and SHMEM versions from CC-SAS for the individual applications. The partitioning of work is the same across models in all cases. 4.1 FFT In the MP implementation, the communication in the transpose phase is sender-initiated for higher performance. Each processor still communicates ( n/p) subrows of size ( n/p) to each other processor, but these subrows are disjoint in the local address space; they are therefore packed into a buffer before sending and unpacked implicitly when transposing locally at the destination. Another change we make from the CC-SAS version, based on observed performance, is that we do not use the linear, staggered way of communicating to avoid algorithmic hot-spots in the transpose. Rather, the all-to-all personalized communication is performed in p 1 loop iteratoins. In each iteration, each processor chooses a unique partner with which to exchange data bidirectionally. After the p 1 iterations, each processor has exchanged data with every other processor. We experimented with other methods, including using smaller messages (a few subrows at a time) to take advantage of the overlap between communication in the transpose and computation in the local row-wise FFTs before or after it. However, the high cost of messages and low amount of work between them ends up hurting performance. The SHMEM implementation is very similar to the MP implementation except that it uses put operations rather than send and receive (the sender-initiated put is more efficient than get here due to latency hiding). 4.2 Ocean In the MP implementation, the grids in this mostly near-neighbor application are partitioned into subgrids in the same way as in the CC-SAS program. A processor sends its upper and lower border data to its neighbors in one message each. When it communicates with its left or right neighbors, the (sub) column of data is noncontiguous and is therefore first packed locally in the application and then sent in one message to reduce communication overhead, and unpacked into the ghost subcolumn at the other end. Unlike in FFT, the SHMEM implementation uses get operations to receive border data in a receiver-initiated way, due to the advantages of get here, but uses the SHMEM strided get functions instead of packing data itself since unlike in MPI there is no performance difference here. 4.3 Radix Our MP implementation follows the same overall structure as the SPLASH-2 CC-SAS program. The first major difference is in how the global histogram is generated from local histograms. In the CC-SAS implementation, it is done using a binary prefix tree. In MPI, the fine-grained communication needed for this turns out to be very expensive. We therefore use an MPI Allgather function to collect the local histograms from all processes and make a local copy of each for all of them. Then, each process computes the global histograms locally. The performance of this phase does not affect overall performance much, which is dominated by the permutation itself. However, having all the histogram information locally greatly simplifies the later computation of parameters for the send/receive functions in the permutation phase. Another difference is that in MPI implementation, it is extremely expensive to send/receive a message for each permuted key. While the writes to contiguous locations in the destination array in the permutation phase are temporally scattered, the keys that processor i permutes into processor j s partition of the output array will end up falling into that partition in several contiguous chunks, one chunk for each radix digit. We therefore buffer the data locally to compose larger messages before sending them out, which amounts to a local permutation of the data (using the now local histograms) followed by communication. An interesting question is how to buffer and send the data. One possibility is for processor i to send only one message to each other processor j, containing all of i s keys that are destined for j. Processor j will then reorganize the data to their correct positions in its array. Alternatively, i can send each contiguously-destined chunk of keys separately to j which can receive them directly into into the correct position in its array, leading to multiple messages from each i to each j but no local data reorganization. Our experiments show that the latter performs better than the former on this machine, and we use the latter, though this bears further experiments that machine access prevented us from performing. A similar local buffering method can be used to reduce the temporal scatteredness of remote writes in the CC-SAS version, but due to the local permutation cost this does not help significantly and we do not use it. Our SHMEM Radix is created from the MP program. Since all processors know all the histogram information, due to the all-gather communication, get is used instead of put since it performs better by bringing the data directly into the cache. The symmetric arrangement of processor s partitions of the output array make this very easy to program.

5 4.4 LU In CC-SAS, each process directly fetches the pivot block data (or the needed pivot row blocks) from the owner, using load instructions. In MPI, however, the owner of a block sends it to the p other processes that need it once it is produced. The SHMEM implementation replaces the sends with get operations on whole blocks. Get is used instead of put since it brings data into the cache, as in Ocean and Radix, and enables reuse of the buffer used by the get operation. 64 SPEEDUPS SHMEM CC-SAS MP 5 Performance Analysis Let us compare the performance of the applications under the different programming models. For each application, we first examine speedups, measuring them with respect to the same sequential program for all models. Then we examine per-processor breakdowns of execution time, obtained using various tools available on the machine, to obtain more detailed insights into how time is distributed in each programming model and where the bottlenecks are. We divide the per-processor wall-clock running time into four categories: CPU busy time in computation (), CPU stall time waiting for local cache misses (LMEM), CPU stall time for sending/receiving remote data (RMEM), and CPU time spent at synchronization events (SYNC). For CC-SAS programs with their implicit data access and communications, the available tools do not allow us to distinguish LMEM time from RMEM time, so we are forced to lump them together (MEM = LMEM + RMEM). However, they can be distinguished for the other two models. In the MP model, since we are using asynchronous mode, on the receiver side the SYNC time is the time spent in MPI Waitall, waiting for the incoming messages for which receives have been posted to arrive in the packet queue, indicating that the data are ready to be copied. During this time, if new messages which are not expected arrive then the receiver will also spend some time processing those messages, but that time is counted as RMEM time. On the sender side, SYNC time is the time the sender spends on adding the control packet into the receiver s incoming queue. The RMEM time is all the time spent for MP functions (like send and receive) excluding the SYNC time. In the SHMEM model, the SYNC time is the global barrier time. The RMEM time is the time spent in get/put operations and collective communication calls; there is a little synchronization time included in these operations, but unlike MPI we do not have the source code for SHMEM and the available tools cannot tell this time apart. For a given machine size, problem size is a very important variable in comparing programming models. Generally, increasing problem size reduces the communication to computation ratio and will tend to diminish the performance differences among models. Thus, although large problems are important for large machines, it is very important to examine smaller problems too. Of course, we must be careful to pay significant attention only to those problem sizes that are realistic for the application and machine at hand, and that deliver reasonable speedups at least for one of the programming models. Our general approach is to examine a range of potentially interesting problem sizes at each machine size. That said, Figure 1 shows the speedups for FFT, OCEAN, RADIX, and LU using three different programming models for only the largest data set we have run (FFT: M double complex data, OCEAN: grid size, RADIX: 12M integers, LU : matrix). For all these four applications, the SHMEM program works quite well. The CC-SAS program is close. For MP, however, none of these four application s performance is initially satisfactory, even though we are using almost the same algorithms and data structures as in SHMEM. Let us examine why. 5.1 Improving MP Performance Consider FFT as an example. Figure 2(a) shows the time breakdown for a smaller, 2K-point data set for FFT on 64 processors. The times are extremely flat across processors, as they are for OCEAN and RADIX as well, since every processor executes nearly the same number of instructions in these applications. The LMEM time is a little imbalanced, but not the major bottleneck FFT OCEAN RADIX LU Figure 1: of FFT(M), OCEAN(25), RADIX(12M), and LU(496) for the three models on,, and 64 processors. Time(us) (a) FFT (Original) Processors (-63) (b) FFT (Direct Copy) Processors ( -- 63) SYNC RMEM LMEM Figure 2: Time breakdown for FFT under the MP model for a 2K-point on 64 processors. here. It is the RMEM time and the SYNC time that are very high and extremely unbalanced in the MP version, and which cause parallel performance to be bad. This is despite the fact that we make a special effort to avoid using rendezvous mode, since it is potentially slower than eager mode, by making the threshold large enough and allocating enough buffer space. Further analysis tells us that the problem is caused mainly by an extra copy in the send function in the MP implementation. As discussed earlier, only the buffers and other data structures used by the MP library itself to implement send and receive calls are allocated in the shared address space (in both the MPICH and SGI implementations). This means that a sending process cannot directly write data into a receiving process s data structures, since it cannot reference them directly, but can only write the data into the shared buffers from where the receiver can read them. Thus, the data are copied twice. If we can copy the data directly from the source to the destination data structures without using the buffers (the sender will no longer copy the data, only the receiver will), we may be able to improve performance by eliminating one copy. Eliminating the use of the shared buffer space has other performance benefits as well. Requesting and obtaining a buffer itself takes time. Worse still, for a large number of processors like 64, processes often compete for allocating shared memory resources, causing a lot of contention in the shared memory allocation function. This contention increases RMEM time at the sender, but also increases SYNC time at the corresponding receivers which now have to wait longer, and causes imbalances in both these time components. Since processes allocate their data in private address spaces in MP programs, eliminating the extra copy (buffering) would normally require the help of the operating system which can access both address spaces. However, since we have an underlying shared address space machine, we can achieve this goal without involving the operating system, if we modify both the application (slightly) and the message-passing library. We increase the size of the shared address space and in the application allocate all the logically shared data structures that are involved in communication in this

6 shared address space, even though the data structures are organized exactly the same way as in the original MP program (they are simply allocated in the shared rather than private address space by using a shared malloc rather than a regular one). How sends and receives are called doesn t change; however, in the MPI implementation now once the send-receive match via the packet queues establishes the source and destination addresses, either application process can directly copy data to or from the application data structures of the other (using memcpy still). In particular, in eager mode, the sender now only places the control packet into the receiver s queue, but does not apply for a shared buffer and copy data. When the match happens, the receiver copies data directly from the sender s data structures (which are in the shared address space). Of course, this means that the sender cannot modify those data structures in the interim (as in a general nonblocking asynchronous send), so additional synchronization might be needed in the application. Without buffers, rendezvous mode now works similarly to eager mode and its overhead is greatly reduced. Short mode remains the same as before since the data to be copied is small and there is no buffer allocation overhead anyway. Let us now compare the performance under different programming models for each application. We will use the improved MP (no extra copy, and lock-free queue management) in all the applications from here on to enable exploration of the remaining performance differences among models once these dominant bottlenecks are alleviated, calling it simply MP, even though it violates the pure MP model of using only private address spaces in the applications themselves. Speedup (MP vs MP-NEW) FFT OCEAN RADIX LU MP MP- NEW (a) FFT 2K (MP-NEW) (b) FFT 4M (MP-NEW) SYNC Figure 4: of MP and Improved MP for FFT(M), OCEAN(25), RADIX(12M), and LU(496) on,, 64 processors. Time(us) RMEM LMEM 2 (MP vs MP-NEW) Processors (-63) 4 Processors ( -- 63) 12 MP MP- NEW Figure 3: Time breakdown for FFT under the MP model with 2K and 4M problem sizes on 64 processors, with the new MPI implementation. Figure 2(b) shows the new per-processor breakdowns of execution time. Removing the extra copy clearly improves performance dramatically and reduces imbalances in RMEM and SYNC time. The speedups for the 2K- and M-point problem sizes have increased from 1.26 and to.17 and 55.17, respectively. However, the speedup is still lower than that of CC-SAS or SHMEM. The SYNC and RMEM time components are still high. This brings us to another major source of performance loss in the MP implementations: the locking mechanism used to manage the incoming packet queues. In the original implementations, when a process sends a message to another it obtains a lock on the latter s incoming queue, adds the control information packet into the queue, and releases the lock. When the receiver receives a message, it also has to use lock/unlock to delete the entry from the queue. This locking and contention shows up as a significant problem, especially for smaller problem sizes. Performance can be improved by using lock-free queue management, as follows. Instead of locking to add or delete a packet in a shared incoming queue for a process, a fixed packet is used to transfer control information between each pair of processes (thus, there are p 2 packet slots instead of p packet queues). A flag in this fixed packet is used to control the message flow. Note that this still provides point-to-point order among messages. The lock-free mechanism further improves the performance of FFT. Indeed, after all these changes to the MP library (mostly) and the application (a little in how data are allocated in address spaces), the performance of the MP versions is comparable with that of the equivalent CC-SAS programs, at least for this problem size. The final time breakdown is shown in Figure 3, for both this and a larger problem size. In fact, using the final improved MP implementation, we find that the performance of OCEAN, RADIX and LU is also greatly improved. The comparison of the speedups for MP and Improved MP (MP-NEW) is shown in Figure 4 for a large data set and Figure 5 for a small data set FFT OCEAN RADIX LU Figure 5: of MP and Improved MP for FFT(64K), OCEAN(25), RADIX(1M), and LU(2) on,, 64 processors. 5.2 FFT 64 FFT SHMEM CC-SAS MP K 2K 1M 4M M Figure 6: for FFT under SHMEM, CC-SAS and MP on,, 64 processors with different problem sizes. The speedups with the three programming models for different data sizes, from 64K to M double complex data, are shown in Figure 6. 1 are quite similar across models at processors for most problem sizes, with differences becoming substantial only 1 We found that for the MP and SHMEM programs, which use explicit communication, the same communication functions take much longer in the first transpose than in following transposes. Detailed profiling showed that the extra time was in the ac-

7 beyond that. Even for larger p, the performance of (the new) MP and SHMEM is quite similar on all data sets. However, compared with CC-SAS, their speedups for smaller problem sizes are much lower. With increasing problem size, their speedups improve and finally catch up with that of CC-SAS. For the M data size on 64 processors, the speedups for all models are about 6. One reason that speedup is so high for big problems is that stall time on local memory relative to busy time may be reduced greatly compared to a uniprocessor, due to local working sets fitting in the cache in a multiprocessor run while they didn t in the uniprocessor case (for small problems they fit in the cache in the uniprocessor case as well, and for very large problems they may fit in neither). Note that the inherent communication to computation ratio itself does not diminish rapidly with problem size in FFT (only logarithmically). So although message sizes become larger and amortize overheads much better, only communication does not account for the large increases in speedup with problem sizes even in MP. To illustrate this, Table 2 shows the average ratios across processors (expressed as percentages) of local memory time to busy time for two data sets, one smaller and one larger, using the MP executions as an example. The reduction in this ratio with increasing number of processors shows that the capacity-induced superlinear effect in local access is much larger for the larger data set (a two-fold reduction in the local memory time component when going from 1 to 64 processors) than for the smaller in this case. Fortunately, this effect applies about equally to all programming models, and is quite clean in our applications even in the CC-SAS case since they do not have significant capacity misses on remotely allocated data. 2 1P P P 64P 2k M Table 2: Average ratio (in percentage) of local memory time to busy time for the 2K- and 4M-point problem sizes in the MP model. Although the capacity-induced superlinear effect is real, we can ignore it by replacing the LMEM time in the uniprocessor case with the sum of the LMEM times across processors in the parallel case. The speedups calculated in this way are smaller, as shown in the no-cap entries in Table 3. For comparison, we also include the actual speedups (including capacity effects) as well. In other applications, such as OCEAN and RADIX, the superlinear capacity effect on local misses is also severe, though again fortunately similar for all models: The no-cap MP speedups for OCEAN for 25-by-25 grids and RADIX for 12M keys are 4 and 25, respectively, while their corresponding actual speedups are and 44. P P 64P 2k-no-cap k M-no-cap M Table 3: FFT speedup comparison with and without cache capacity effects for 2K- and 4M-point problem sizes on,, 64 processors. The per-processor execution time breakdowns for the 2K and 4M problem sizes on 64 processors for MP, CC-SAS, and SHMEM tual remote date movement operations like memcpy and bcopy (which are basically identical) used by these functions, which are much more expensive when invoked for the first time on a given set of pages. This page-level cold-start effect, which is not substantial in the CC-SAS case, is eliminated in our results by having each processor in advance touch all remote pages it may need to communicate with later. Simply doing a single load-store reference to each such page using the machine s shared address space support, suffices, though many other methods will do. This touching is done before our timing measurements begin. This cold-start problem on pages is large only for kernels with a lot of communicated pages like FFT and Radix (as we will see later); real applications that use these kernels may use them multiple times, amortizing this cost. 2 CC-SAS has another small problem for making this measurement in that some of the local misses in the transpose are converted to remote misses, but MP and SHMEM do not have this problem since all transposition is done locally separately from communication. are shown in Figures 3, 7 and, respectively. The time for SHMEM and MP for each problem size is almost the same, and is a little higher than that of CC-SAS. This is primarily due to the extra packing and unpacking operation needed in SHMEM and MP programs, in which the (noncontiguous) sub-rows of a transferred n/p by n/p patch are packed contiguously before they are sent out and unpacked after they have arrived at the destination. In CC- SAS, on the other hand, the data are read individually at the fine granularity of cache lines, so there is no need to pack and unpack the data. This difference is imposed by the performance-driven need to make messages larger in SHMEM and MP. Time (us) (a) FFT 2K (CC-SAS) 4 Processors ( -- 63) (b) FFT 4M (CC-SAS) 4 Processors ( -- 63) BARRIER MEM Figure 7: Time breakdown for FFT under the CC-SAS model with 2K and 4M point problems on 64 processors. Time(us) (a) FFT 2K (SHMEM) 4 Processors (-63) (b) FFT 4M (SHMEM) 4 Processors (-63) SYNC RMEM LMEM Figure : Time breakdown for FFT under the SHMEM model with 2K and 4M point problems on 64 processors. The main differences among models for FFT lies in the data access stall components. The CC-SAS model has a much lower MEM time than the others for smaller data sets and larger p. Recall that we have to lump LMEM and RMEM together in this model since they cannot be separated by the available tools. When n/p is small, so are the messages in MP and SHMEM, so message overhead (of software management in MP as well as of the basic data transfer operations used by both MPI and SHMEM) is not well amortized. This is further worse in MP than in SHMEM, both since the producer-consumer communication needed for (control) packet queue management is a poor match for the underlying invalidation-based cache coherence protocol and especially since an explicit send and a matching receive must be initiated separately for each communication. The latter potentially increases not only messaging overhead and end-point contention further but also synchronization time, since the sends and receives have to be posted in timely ways and matched. In the CC-SAS model, the transfers of cache blocks triggered by loads and stores are very efficient for the fine-grained communication needed. With automatic hardware caching, the data fetched also arrive in the cache rather than in main memory, and can be used very efficiently locally. As n/p increases, message size increases and explicit communication with send-receive or (more so) put/get becomes more efficient, so the performance of MP and SHMEM equals that of CC-SAS. Finally, consider the difference between MEM times in SHMEM and MP. SHMEM s RMEM time is less than that of MP because

8 its one-sided communication is more efficient as discussed above. But, surprisingly, it has a much higher LMEM time. Through further analysis, we find that the greater LMEM time is spent in the transpose phases, specifically during the local data movement needed to unpack the deposited or received data, i.e. extract the subrows of the transferred square blocks and move them to the correct transposed positions in the local matrix. The data transfer operation we use in SHMEM is put. Unlike with a receive in MP, where the data that are moved to the receiver s data structures are also placed in its cache, the bcopy in a put places the data in the receiver s main memory but not in its cache (see Section 4). This means that when unpacking the data, the MP code reads the data out of cache while the SHMEM code reads it from main memory, increasing local memory stall time. We verify this by measuring the unpacking separately, as well as by having the destination process of the put touch the buffer before unpacking (thus bringing the data into the cache), in which case the unpacking is as fast as in the MP version. Using get instead of put helps with the caching issue, but communication latency is not hidden well and hence synchronization time increases, so overall there is not much difference in performance in FFT. Note that in CC-SAS the transposition of data is done as part of the load-store communication itself, and the data are brought into the destination processor s cache. 5.3 OCEAN Ocean has many grid computations in each time-step that use many different grids, some of which involve near-neighbor communication and some of which involve no communication. A large fraction of the execution time is spent in a multigrid equation solver in each time-step, which exhibits nearest-neighbor communication but at various levels of a hierarchy of successively smaller grids. OCEAN SHMEM CC-SAS MPI The time breakdowns for the intermediate, 126-by-126 grid size on 64 processors are shown in the Figure 1. The times are very balanced and similar. The MEM time in all three cases is imbalanced, but is much higher and more imbalanced in CC-SAS. There are several likely reasons for this behavior of CC-SAS relative to SHMEM and MP, which we unfortunately cannot determine easily because of the lack of available tools (note that the LMEM category for CC-SAS is actually MEM = LMEM + RMEM). One is poor spatial locality at cache block granularity for remote access at the column-oriented boundaries of the square partitions: Only one boundary word is needed in each row, but a whole cache block is fetched. This poor spatial locality is on local rather than remote accesses for MP and SHMEM since they pack the data contiguously locally before communicating. Another likely possibility is that local capacity misses behave differently across programming models. In MP and SHMEM, a process s partitions of all the different grids are allocated contiguously in its private address space, while in CC-SAS each entire grid is allocated as a large contiguous shared array in the shared address space; even though a process s partition of each grid is contiguous due to the use of 4-D arrays, there is a very large gap between a processor s partitions of different grids in the data layout. This causes many more and imbalanced local conflict misses in Ocean since multiple grids are accessed together in many computations. A third possibility is that perhaps certain kinds of data and and pointer arrays have not been properly placed among the distributed memories in CC-SAS, though they have to be in MP and SHMEM, and these become relatively more of an issue at smaller problem sizes. (We already obtained a great improvement compared to our original program by placing some data structures better; perhaps more work can be done on this, though the lack of appropriate information from the machine makes the problems difficult to diagnose.) Larger problem sizes and smaller machines make local capacity misses dominate, so the difference between models is small. 5.4 RADIX 64 RADIX SHMEM CC-SAS MPI Figure 9: for OCEAN under SHMEM, CC-SAS and MP on,, 64 processors with different problem sizes. The speedups for OCEAN are shown in Figure 9. For processors, the speedups in the three programming models are similar; however, there are large differences for larger processor counts; in particular, the performance of CC-SAS is now much worse for smaller problem sizes (the opposite of the FFT situation). Time (us) (a) OCEAN (CC-SAS) Processors ( -- 63) (b) OCEAN (MP) Processors ( - 63) (c) OCEAN (SHMEM) Processors ( - 63) SYNC RMEM LMEM Figure 1: Time breakdown for OCEAN (126) on 64 processors M 4M M 64M 12M Figure 11: for RADIX under SHMEM, CC-SAS and MP on,, 64 processors with different problem sizes. The speedups are shown in Figure 11. Unlike in FFT and OCEAN, no model performs very well for data sets smaller than 64M integer keys, though SHMEM is much better than the others. On larger data sets, the three become closer, though SHMEM is still the best followed by MP. The per-processor time breakdowns for the largest, 12M-key problem on 64 processors are shown in Figures 12. In all models, time is very small, although there is some increase in SHMEM and MP due to the additional local work in the permutation (see Section 4). The MEM time is very high, including both LMEM and RMEM. 3 Communication is very bursty and there is 3 Similarly to FFT, for the MP and SHMEM programs the communication operations take much longer in the first communication (permutation) phase Touching remote pages in advance solves this problem, though unlike FFT it is unpredictable which remote pages will be communicated with so all in the whole logical array are touched.

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate