Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II" via Claudio Napoli, Italy

Size: px
Start display at page:

Download "Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II" via Claudio Napoli, Italy"

Transcription

1 MPI-FM: High Performance MPI on Workstation Clusters Mario Lauria Dipartimento di Informatica e Sistemistica Universita di Napoli \Federico II" via Claudio Napoli, Italy lauria@nadis.dis.unina.it. Andrew Chien Department of Computer Science University of Illinois at Urbana-Champaign 1304 W. Springeld Ave. Urbana, IL 6101, USA achien@cs.uiuc.edu Abstract Despite the emergence of high speed LANs, the communication performance available to applications on workstation clusters still falls short of that available on MPPs. A new generation of ecient messaging layers is needed to take advantage of the hardware performance and to deliver it to the application level. Communication software is the key element in bridging the communication performance gap separating MPPs and workstation clusters. MPI-FM is a high performance implementation of MPI for networks of workstations connected with a Myrinet network, built on top of the Fast Messages (FM) library. Based on the FM version 1.1 released in Fall 1995, MPI-FM achieves a minimum oneway latency of 19 s and a peak bandwidth of 17.3 MByte/s with common MPI send and receive function calls. A direct comparison using published performance gures shows that MPI-FM running on SPARCstation 20 workstations connected with a relatively inexpensive Myrinet network outperforms the MPI implementations available on the IBM SP2 and the Cray T3D, both in latency and in bandwidth, for messages up to 2 KByte in size. Visiting at time of writing 1

2 We describe the critical performance issues found in building a high level messaging library (MPI) on top of a low level messaging layer (FM), and the design solutions we adopted for them. One such issue was the direct and ecient support of common operations like adding and removing a header. Another was the exchange of critical information between the layers, like the location of the destination buer. These two optimizations are both shown to be necessary, and their combination sucient to achieve the aforementioned level of performance. The performance contribution of each of these optimizations is examined in some detail. These results delineate a new design approach for low level communication layers in which a closer integration with the upper layer and an appropriate balance of the communication pipeline stages are the key elements for high performance. 2

3 1 Introduction Growing interest in the use of networks of workstations to perform high performance computation has been spurred by the remarkable growth of their computational performance. Moreover, when compared to their closest competitor (massively-parallel processors based on commodity processors) workstations have other advantages. They are relatively cheap, and their large sales volume attracts investments directed to their rapid improvement. Software is abundant, readily available, and has a large base of established users. The use of networked machines to do distributed computing is not new, and a number of communication libraries which use TCP/IP over Ethernet have been around for some time now (BSD sockets [14], PVM [21]). However, Ethernet and its associated networking protocols were not designed for high performance computing, and their limitations when used for this purpose severely restrict the range of applications that can be run and achieve good parallel performance. Communication protocols require a number of services to work, like timers, buer management, process protection, and process notication. In traditional protocol implementations, these services are provided by the operating system. But the convenience of relying on the operating system is paid for in terms of additional copies between address spaces, and by the context switch occurring at each system call [7], both of which contribute to overhead and reduce performance. As a result, on a network of workstations, the observed communication latency over TCP/IP is on the order of one half to one millisecond. In the same time interval, a typical workstation RISC processor can execute tens to hundreds of thousands instructions. Workstation clusters have recently become attractive for high performance computation due to the introduction of new communication technologies with much improved performance. The fast Local Area Networks (LANs) available today (ATM [5], FDDI [], Fibrechannel [1], Myrinet [3]) are, in terms of latency and bandwidth, comparable to the proprietary interconnect found on MPPs. However, without a radical change in the way communication protocols are implemented, applications will not be able to reap the benets of this new technology. Existing protocol implementations have been shown to achieve only modest performance improvements when used on new network hardware [15]. Our solution is to build new communication software, designed from the start with the objectives of low latency and high bandwidth communication. In the context of the Fast Messages (FM) project [17], we selected the Myrinet network due to its performance, programmability and price/performance ratio. On this network we wrote the FM library, a highly optimized, low latency messaging layer providing a virtual interface for the hardware [17]. The lowest layer of the communication software often loses most of the raw performance, due to architectural barriers (I/O bus) and to the large dierence in abstractions (hardware device vs. programming interface). So in this rst part of the project the goal was to minimize the performance loss. All choices, including the decision as to which services to include in the interface, were driven by performance considerations. The design of FM addresses some of the critical issues found in building a low level messaging layer: division of labor between the host and the network coprocessor, management of the input/output (I/O) bus, and buer management. Implemented entirely in user space, 3

4 FM avoids the high overhead of system calls. By providing a few key services { buer management, reliable and in-order delivery { the FM programming interface allows for a leaner, more ecient implementation of the higher level communication layers. FM achieves a short message latency of only 14 s and a peak bandwidth of 17.6 MB/s, with an Active Messages style interface. The rst part of the research project is complete and is documented in [17], and only a few details of the FM interface will be presented here. The second phase of the project constitutes the object of this work, and will be described in greater detail. After this rst phase, the problem of optimizing the FM interface was tackled. For this purpose, another communication layer, the MPI library, was built on top of FM to close the gap toward the user level. Then the entire communication hierarchy was studied to gain a better understanding of the origins of the software overhead. Once identied, the major sources of overhead were removed by modifying as required the FM interface. As a result, only those services that were shown to be strictly indispensable in reducing the overhead, and that could be implemented without substantial performance degradation, made their way into FM. One of the changes we made to FM was adding a simple gather, to support common operations like adding and removing headers. Another was the inclusion of an upcall to allow the exchange of critical information between the layers, like the location of the destination buer. Most importantly, while all these optimizations individually contributed to performance, only their simultaneous application exposed all their potential benet, revealing the importance of balancing the messaging layer across all the processors involved { sender, receiver, network interfaces. The performance achieved with the nal version of MPI-FM is 19 s 0-byte latency and 17.3 MB bandwidth. When compared with MPI-F, the optimized MPI implementation for the IBM SP2, MPI-FM shows better performance for message sizes of 2 KB or less. We were unable to nd any results for network of workstations that provided comparable performance. A current implementation of MPI on TCP/IP is typically two orders of magnitude worse for latency, and one order for bandwidth, so the comparison is hardly interesting. To the best of our knowledge, MPI-FM is the fastest implementation of MPI available for workstations at the time of writing (Spring 1996). The remainder of this work is organized as follows. In Section 2, some background information is given on the three basic components of MPI-FM, which are Myrinet, FM for Myrinet and MPI. In Section 3, some related works are examined. In section 4, the details of the implementation of MPI on top of FM 1.1 are given. The performance measurements are presented in Section 5, along with the analysis of the results. Section 6 will conclude this work with a summary of the contributions, some nal considerations, and some topics for future research. 2 Background Workstations The workstations used for the measurements are Sun SPARCstation 20/71's. Each has a 75 MHz SuperSPARC-II processor and a 1 MB second level cache (SuperCache), and is rated at 125. SPECint 92/121.2 SPECfp 92. The SPARCstation 20 has a Sun-4m 4

5 architecture, whose main feature is a high performance cache-coherent memory bus (MBus). In our tests, the large second level cache was particularly important, as it reduced the cost of copying data out of the DMA region (the region of pinned memory in which FM stores incoming messages). The I/O bus (SBus) on which the network interface resides is clocked at 25 MHz and has a peak transfer rate of around 45 MB/s. A major problem with the Sun workstation architecture is that its I/O bus is optimized for large message transfers. Using programmed I/O (i.e having the CPU move one double-word at a time) instead of the interface's DMA reduces the peak transfer rate to 22 MB/s. In the present version of FM, this represents a major bottleneck for bandwidth and directly limits performance for long messages. Network Myrinet is a high speed LAN interconnect which uses bidirectional byte-wide copper links to achieve physical bandwidth of nearly 0 MB/s in each direction [3]. It uses the interconnect technology developed for the Caltech Mosaic project [1]. A Myrinet network is composed of network interfaces connected to crossbar switches by point-to-point links. The full crossbar switches have four or eight ports, and can be interconnected in an arbitrary topology. They use wormhole routing, which allows the packets to be switched with a latency of only about half a microsecond. The network interface consists of a custom VLSI chip (the LANai), 12 KB of fast SRAM, dierential line drivers/receivers for the link, and the SBus control logic. The LANai contains a link interface, a processor and three DMA engines (one each for the incoming channel, outgoing channel, and the SBus). The LANai processor is a rather slow CISC processor, and is clocked at the SBus speed (25 MHz in our machines). Host-LANai interaction is achieved by mapping the interface memory into the host address space. While the host can read/write the interface memory with load/store operations, it cannot start the DMAs. Single word accesses to the LANai memory are rather expensive because they cross the SBus. As in most systems, DMA transfers to/from the host must be performed through a pinned-down DMA buer in the kernel address space. FM for Myrinet Table I lists the three FM primitives. FM diers from a pure message passing paradigm in that there are two asynchronous send operations but no corresponding receives. Rather, each message includes the name of a handler, which is a user-dened function that is invoked upon message arrival, and that will process as required the carried data. When invoked, the FM extract() function processes the pending received messages, dequeuing them and executing their handlers. FM send4() is a specialized version for messages of no more than four words, optimized for latency. The FM interface is a generalization of the Active Message model [22], in that there are no restrictions on the communication operations that can be carried out by the handler. (The user is responsible for avoiding deadlock situations.) Reliable and in-order delivery guarantees can be expensive if implemented in the upper messaging layers [13]. Their cost can be decreased if built directly into the lower level layer, where there is an opportunity to take advantage of some useful features of the network. Through a careful design which exploits the characteristics of the Myrinet architecture, FM 5

6 Function FM send 4(dest,handler,i0,i1,i2,i3) FM send(dest,handler,buf,size) FM extract() Operation Send a four word message Send a long message Process received messages Table I: The FM 1.1 communication primitives oers reliable and in-order guarantees with minimal performance degradation. FM is composed of two parts, the host program and the LANai control program. The division of labour between the host and the interface processor is critical for performance because of the potential for parallelism in protocol processing but also for bottlenecks if the load is not properly balanced. The relative dierence in the speeds of the two processors and the modest performance of the LANai in absolute terms, suggest assigning most of the work to the host, and keeping the LANai control program as simple as possible. To send a packet, the host processor writes the data directly into the LANai memory, one double-word at a time (programmed I/O). This procedure saves the double cost of copying the data into the DMA buer, plus synchronizing with the LANai to get it to start the DMA. Since programmed I/O cannot take advantage the faster burst transfer mode of the SBus, this solution improves latency at the expense of bandwidth. When a packet arrives at a node, the LANai moves it into host memory (DMA buer) with a DMA transfer. This procedure ensures good bandwidth, quick delivery of the packet, and prompt draining of the network, even when the host program is busy or not available (i.e., descheduled). The packets are deposited into a receive queue in the DMA buer, from which they will be dequeued and processed by the FM extract() primitive. FM uses a xed packet format. This streamlines the queue management in the LANai control program. It also allows the overlapping of the protocol processing in the sender, network interface and receiver during the transfer of messages. A potential drawback is the increased complexity of the assemble/disassemble operations. Much of the work done on MPI-FM consisted precisely in dealing with this issue. The choice of the size of the packet was the result of a tradeo between contrasting benets. A small packet size gives better latency, increases pipelining through the network, and gives potentially better network utilization. A larger size reduces the overheads, and thus gives better bandwidth. The current size of 12 bytes has been found to be the best compromise. Messages longer than 12 bytes are segmented and then reassembled within FM. Given its limited programming interface, the FM library is targeted to language and library developers rather than to end users. The range of applications being considered for development on FM, or already in the works, is testimony to the exibility of its interface. Besides the MPI message passing library, these are the BSD socket interface, the Converse compiler back-end [12], the Tempest runtime library [11], the Orca Project parallel object language [2]. 6

7 MPI and MPICH MPI (Message Passing Interface) is a message passing library, with primitives specications for both C and Fortran. It is the result of the work of the MPI Forum, a committee composed of vendors and users started at the Supercomputing 92 conference with the aim of dening a message passing standard. MPI features a range of functionality, including point-to-point, with synchronous and asynchronous communication modes, and collective communication (barrier, broadcast, reduce, gather/scatter). MPI supports multicast and separate communication contexts by means of the communicator construct. A number of datatypes are predened, and userdened data types are supported as well. MPI comprises more than 120 functions. Such an intimidating size does not conict with the possibility of a low overhead, high bandwidth implementation. In fact, this number of primitives is the result of the orthogonal combination of a small number of concepts, so that each primitive is actually capable of rather streamlined implementation. The MPI standard has been gaining support in the parallel computing community since the presentation of the initial draft standard in 1993 and its formalization at the Message Passing Interface Forum [9]. One of MPI largest attractions is the number of free implementations that have been made available. The quick and ecient realization of the MPI library on top of FM was made possible by the existence of one of these publicly available implementations. Among the many MPI implementations, MPICH, developed by Argonne National Laboratory and Mississippi State University, appears to be one of the most popular 1. MPICH portability derives from being built atop a restricted number of hardware-independent low level functions, collectively forming an Abstract Device Interface (ADI). Implementing the ADI functions is all that is required to run MPICH on a new platform. The ADI encapsulates the details and complexity of the underlying communication hardware into a separate module. By restricting the services provided to basic point-to-point message passing, it oers the minimum required to build a complete MPI implementation, resulting at the same time in a very general and portable interface. On top of the ADI, the remaining MPICH code implements the rest of the MPI standard, including the management of communicators, derived datatypes, and the implementation of collective operations using point-to-point operations. Table II summarizes the core message passing routines, representing the minimum set required to have a fully functional implementation. The ADI can be described as a virtual message passing interface with split-phase operations. To send or receive a message, a request is posted. A request is a structure containing all the relevant data: pointer to the message, message length, receiver/sender id, tag, and communicator. A second call is required to complete the send or receive operation. A number of ancillary functions are provided to cancel an ongoing communication, get the number of nodes in the network, and probe for the arrival of a specic message. 1 The MPICH and other MPI implementations can be obtained at /mpi/index.html 7

8 Function MPID Cancel MPID Ceck device MPID Complete send MPID Complete recv MPID End MPID Init MPID Iprobe MPID Myrank MPID Mysize MPID Post send MPID Post send ready MPID Post send sync MPID Test recv MPID Test send MPID Test recv Operation Cancel a pending oper. Checks for pending oper. Completes a send Completes a receive Terminates the ADI Initializes the ADI Check if specic msg has arrived Rank of calling process Number of processes Starts a send operation Starts a send, ready mode Starts a send, synchronous mode Starts a receive operation Test for completion of a send Test for completion of a recv Table II: The ADI core routines 3 Related Work A number of other research projects are focusing on the construction of an integrated high performance hardware-software communication subsystem for network of workstations. U- Net [23] is built using FORE ATM interface cards. It presents an AAL5 programming interface to the applications. As in FM for Myrinet, the interface cards have a processor used to relieve the host processor of some tasks { for U-Net, segmentation and reassembly of messages, message demultiplexing, and DMA transfers to/from the host memory. One problem of ATM networks is the cost of the switch, much higher than the cost of a comparable Myrinet switch. Hamlyn[4] implements a sender-based memory management scheme, and gives applications direct access to the network interface. There are plans to implement it on a proprietary version of Myrinet for HP workstations. Cranium [] is in many respect similar to Hamlyn, and is to be implemented on a experimental interconnection network, Chaos. However, to our knowledge, at the time of this writing there are no high performance MPI implementations available on these interfaces. As mentioned earlier, a number of libraries are already available to do parallel computing on network of workstations. Libraries like PVM and MPI are typically built on top of the TCP/IP protocol found in virtually every workstation. The performance oered by TCP/IP, and consequently delivered to the libraries built on top of it, is one or two order of magnitude worse than what is available on MPPs. The reason is essentially that both the protocol itself and its traditional implementation within the operating system were designed with other objectives in mind rather than low latency and high bandwidth. For instance, tolerance to high latency and high error rates, or protection among processes using the

9 network were primary design objectives. Using a faster medium than Ethernet does not bring much improvement [15]. Some of the same libraries have been ported to MPPs. For example, two implementations of MPI are available on the SP2. One is a port of MPICH, the other (MPI-F) is a native implementation [10]. Both achieve 33 MB/s of peak bandwidth, with a 0-byte latency of 40.5 s for MPI-F and 55 s for the MPICH port. One feature of MPI-F is the packing/unpacking on the y of complex data types, which exploits a pipe abstraction provided by the underlying communication layer. The MPI for T3D 2 has been produced by the Edinburgh Parallel Computing Centre (EPCC) in collaboration with Cray Research, Incorporated (CRI). It achieves a minimum latency of 43 s; we measured a peak bandwidth of only 31.7 MB/s, despite a network which can provide up to 300 MB/s (however we have not tried all the possible versions of MPI send and receive). MPI is available on the AP1000, where it achieves 332 s minimum latency and 2.5 MB/s peak bandwidth [19]. 4 Design The approach followed in creating MPI-FM has been that of incrementally rening a straightforward implementation of the ADI. At each step, a dierent optimization was tested by modifying both the ADI and the FM interface as required according to the results of the performance analysis. Thanks to this incremental process, it has been possible to precisely measure the impact of each modication. In this section, a general outline of the design is presented, followed by a description of the dierent versions of MPI-FM resulting from the modications. In the next section the performance measurements for each version are presented. 4.1 General outline Figure 1 shows the composition of MPI Send and MPI Recv in terms of dierent layers' primitives. The MPID prex denotes the ADI functions. FM extract and the handler used for this type of receive, async handler, are connected with a broken line because they are responsible for the advancement of the ADI receive routines, even though they are not called directly by them. Notice how in this particular case an ADI function is dened in terms of two others. It is often the case, both in the ADI and in the MPI layer above it, that a more complex function is dened in terms of simpler ones. For example, collective communication functions are usually implemented in terms of point-to-point primitives. The ADI primitives are semantically much closer to the MPI than to the FM interface. As a matter of fact, only a few lines of code are required in MPICH to dene most of the pointto-point MPI functions in terms of ADI primitives. Consequently, much of the complexity is in the ADI-to-FM translation, which amounts to bridging a much wider semantic gap and two rather dierent programming models. The ADI-to-FM layer contributes the majority 2 Available at 9

10 MPI_Send MPICH layer MPI_Recv MPID_Blocking_send ADI layer MPID_Blocking_recv MPID_Post_send MPID_Complete_send MPID_Post_recv MPID_Complete_recv FM_send FM layer FM_extract, async_handler Figure 1: In MPI-FM the MPI primitives (MPI Send and MPI Recv in this gure) are built on top of the ADI routines (MPID prex). The core ADI routines are in turn implemented using the FM primitives (FM prex) of the additional overhead with the ADI to MPI layer contributing only on the order of a microsecond to the latency of a basic send-receive. We found the ADI to be quite well suited to matching a low level layer such as FM due to its generality (no unnecessary restrictions built in) and exibility (sucient provisions to support a variety of devices). The key issue in the implementation of the ADI is the dierence between the FM and ADI abstractions. As seen in Figure 1, there is not a one-to-one mapping between ADI and FM primitives. The two major dierences between the two interfaces are (i) the ADI primitives are split phase (\start send { complete send"), and (ii) FM does not have an explicit receive operation. The split-phase nature of the ADI communication primitives implies that a state must be maintained for each pending operation. This is an important dierence with FM, whose communication paradigm is stateless. To store the state of each request two queues have been employed, one on the sender and one on the receiver side. Each element of the queues contains a description of the request and its current state. The two queues will be called send queue and receive queue in the following discussion. On the receive side, an additional queue (unexpected queue) is needed to store unexpected messages, i.e. incoming messages for which a receive request has not yet been posted. Each element of the queue stores a pointer to a temporary buer, in addition to the description of the message. This queue and the associated buering would not be needed if messages were kept waiting on the send side and then transferred upon posting of the receive request. But this kind of transfer-on-demand protocol would result in a higher average latency than the eager transfer protocol with buering on the receive side, as employed by the ADI. In the latter scheme, a late receive would incur only the cost of a memory-to-memory copy, as opposed to the cost of a two way communication. Instead of an explicit receive operation, FM relies on user-dened functions (handlers) to handle the content of the received messages. Incoming messages are enqueued in the host memory, and are extracted in turn by FM extract, which executes the respective handler. 10

11 Implementing the ADI receive primitive meant writing the appropriate handler and carefully placing FM extract calls in the code. The task of the handlers is to copy the content of the incoming message into the destination buer whose pointer is found in the receive request for that specic message. A temporary buer is used in the case of a unexpected message. A dierent type of handler is required for synchronous operations, which is given the additional task of sending an acknowledgment message to the sender. The numerous calls to FM extract guarantee that the messages are extracted, and their handlers executed, with the smallest possible delay. 4.2 The four versions of MPI-FM Two signicant optimizations of the ADI and FM layers are analyzed in this paper. These are a gather mechanism on the send side, and an upcall function on the receive side. The four versions of MPI-FM described are (i) the base implementation, (ii) with the gather only, (iii) with the upcall only, and (iv) the nal version with both optimizations. MPI-FM base implementation The latency and bandwidth curves of the rst version of MPI-FM are shown in Figure 2. The 0-byte latency and the peak bandwidth of MPI-FM are 21 s and 3.9 MB/s respectively, versus 14 s and 17.6 MB/s for FM. While the base implementation is quite good in absolute terms, its distance from the FM curves shows that a good share of the FM performance was lost, and there was plenty of room for improvement FM MPI-FM base us MB/s FM MPI-FM base K 2K 4K K K 2K 4K K K (a) Latency (b) Bandwidth Figure 2: Performance of the MPI-FM base implementation Through detailed measurements, the overhead was broken down to its components. As expected, the biggest components were memory-to-memory copies. These copies are employed 11

12 in the straightforward implementation of some typical operations, like adding a header to a packet, found in almost every communication layer. Therefore, exploring ways of eliminating them is of more general interest. Since the copies were being utilized for dierent purposes throughout the code, we had to come up with a dierent solution in each case, most of them involving modications to FM. While these modications are very specic to the implementation of MPI-FM, an upcoming release of FM will implement the same principles in a much more generalized fashion. Optimization #1: MPI-FM with upcall One of the memory-to-memory copy operations targeted for elimination was being used on the receive side in the reassembly of long messages. As shown in Figure 4 (middle), within FM long messages (> 104 bytes) are broken in packets on the send side, then put back together in a temporary buer on the receive side. Upon completion of the reassembly, the application-specic handler specied in the message is called. The handler usually copies the message content to its nal destination within the application. For example, in the case of the ADI implementation, the nal destination is the buer that the MPI user specied in the receive operation (or a temporary buer, if there is no pending receive). The reassembly of long messages before they are handed out to the application handler represents an additional copy with respect to the case of short messages. Figure 3 shows the FM performance degradation resulting from this additional copy. The dierence in the two measurements is in the action performed by the application handler: in one case the message is not touched once reassembled, in the other case is being copied out to a destination buer. us FM FM with additional copy K 2K 4K K K32K MB/s FM FM with additional copy K 2K 4K K K 32K (a) Latency (b) Bandwidth Figure 3: FM: performance impact of copying the data out of the reassembly buer on the receive side 12

13 MPI_Send buffer (MPI-FM) LANai + network DMA region (FM) Reassembly buffer (FM) MPI_Recv buffer (MPI-FM) Small messages (< 104 bytes) Segmentation & reassembly Segmentation & reassembly with upcall Figure 4: Segmentation & reassembly in MPI-FM The ideal solution would be to reassemble the message directly in the destination buer (Figure 4, right). Let's look at this possibility from the point of view of the implementor of FM (or of any similar low level messaging layer). There are two obstacles to overcome: (i) the destination buer for the incoming message is generally known only to the application code running on top of FM (the ADI in the case of MPI-FM), and (ii) only the message body, i.e. the payload, must go into the destination buer, not the headers added by FM or the ADI). The location of the destination buer is in general only known by the application. Furthermore, the application needs to know the identity of the message in order to choose the appropriate buer. In other words, the knowledge about what buer to use is constrained both in space (only the application code on the receiver side knows) and in time (the code knows only after the message has arrived). For example, the ADI needs to know the source, tag and communicator of an incoming message to tell which (if any) receive request has been posted for it. Rather than trying to relax these constraints, for example by duplicating on the send side the knowledge about which buer to use, we adopted an inquiry mechanism employing an upcall [6]. An upcall is a function dened inside an upper layer and invoked by the layer underneath. Whenever the rst fragment of a long message arrives, FM asks the application code for a buer to use as reassembly buer. This is accomplished by invoking the upcall with the fragment as an argument. The upcall return value, if not null, is assumed to be a pointer to a buer; if null, FM provides its own temporary buer. The passed fragment contains sucient information to determine the identity of the incoming message because it is the head of the message, where the application code puts a header containing the message attributes 3. To allow the application to drop the header of the incoming message before copying it in the destination buer, the reassembly mechanism of FM has been slightly modied to allow the upcall to increment the pointer to the fragment passed as argument. Once the function has read the information needed for message identication, it can advance the pointer to the 3 It also contains the header that FM adds to each fragment, but since the FM header is actually appended to the fragment, the user code can safely ignore it. 13

14 start of the payload. FM will then start copying the fragment into the reassembly buer starting from the updated position. As mentioned earlier, this mechanism is used only for long messages. Messages no longer than a packet are not segmented and don't require reassembly; thus FM is not involved in pulling them out of the incoming messages queue. That is, in the case of short messages, there are no unnecessary copies to be eliminated. Another consideration about the upcall is that it exposes the issue of upward/downward interfacing between contiguous layers. A common view is that in a protocol stack a layer must oer a number of services to the one immediately above it, which in turns uses them to implement more complex functionality. A less explored aspect of this interaction is the opposite relationship, that is the services an upper layer can usefully oer to the one below. It is sometime the case that a certain piece of information needed by a low level layer is available only in an upper layer. In the MPI-FM case, the information about the buer to use, needed by FM, is available in the ADI. The ADI in turn rst needs to know the identity of the message before it can choose the corresponding buer. The problem with the traditional layered and rigidly hierarchical model is that it is too restrictive on how functionality can be distributed across the layers. For example, building a rigidly hierarchical MPI-FM would require putting the buering in FM, rather than in the ADI, so to avoid the bottom-up exchange of information. Conversely, the adoption of the upcall adds one degree of exibility, by allowing the low level (FM) to query the upper level (ADI). Optimization #2: MPI-FM with gather The other copy targeted for elimination was the one used on the send side in assembling packets. Typically the code running on top of FM adds its own header to the data before sending it to another node. This header contains information about the message being transmitted, so that the code on the receiving node can identify it. Once utilized, the header can be disposed of and the data retrieved from the message. In this process, the crucial steps are adding the header ahead of a given data buer, and getting rid of the given message's header. This is a special case of the more general problem of assembling a packet out of several scattered pieces and later disassembling it into pieces at dierent memory locations. The simple minded approach is to use an intermediate buer, to copy in the pieces on the send side, or to copy them out on the receive side, at the cost of an additional copy. Figure 5 shows the performance degradation at the FM level due to the additional copy. The same number of bytes is being transmitted in the two tests. But in one case they are from a single contiguous memory area, in the other they are from two separate buers (the rst containing 24 bytes, the second containing the rest) and must rst be copied into a contiguous buer. This issue arises from the fact that FM, like many other low level messaging layers, deals only with messages that are contiguous in memory. For FM, however, this restriction can be lifted without major changes to its structure. When the FM send primitive is invoked, the message is copied to the network interface a double word at a time in sequence. Since each double word could be copied from an arbitrary location instead, a contiguous message is not strictly required. Thus, the FM send has been modied to accept a two-part message. The rst part ts the header size used by the ADI (24 bytes), the second part can be of arbitrary size. The new primitive, FM gather24, performs a gather on the y by copying all 14

15 FM FM with two-part message us MB/s FM FM with two-part message K 2K 4K K K 32K K 2K 4K K K 32K (a) Latency (b) Bandwidth Figure 5: FM: performance impact of assembling a contiguous message on the send side 24 bytes from the rst segment, and all the remaining data from the second segment directly into the network interface. When the message is bigger than a packet, the second segment is fragmented as it would be in FM send. Despite being quite a primitive gather facility, FM gather24 proved to be sucient to substantially reduce the overhead of the send side. Since FM does not have an explicit receive operation with which to build the dual operation, scatter, on the receive side, the two-part scatter had to be implemented elsewhere. The pointer update feature built into the upcall function, although limited to dropping the header, has conveniently substituted for a full edged scatter in this context. The assembly and disassembly of packets is a common function performed by messaging layers. In its simplest form, a header needs to be added in front of a payload, containing information for the corresponding layer on the receiving node. On the receiver side the opposite operation is performed. The ADI, and the underlying FM, are not the only example of this kind of packet manipulation; most of the layers in the ISO and ATM hierarchies perform the same type of operation. A two piece assembly/disassembly is the simplest case of the general gather/scatter functionality which allows a message to be built out of a number of pieces. The need for a bigger number of components is commonly associated with the packing/unpacking of messages containing noncontiguous data types. A more general form of gather/scatter is being considered for implementation in FM; it could be usefully employed in the assembly/disassembly of messages containing MPI derived data types. It is worth noting once more that the most common form of gather is also the simplest, and it can already make a big dierence. Whether involving two pieces or more, the crucial issue for the gather/scatter is avoiding additional copies. Copies can be avoided if the capability of accepting messages in multiple 15

16 parts is preserved all the way down the layer hierarchy to the lowest one, where the actual copying of the data in the network interface memory takes place. Restoring this feature in FM was not only possible, even if in a limited form, but also inexpensive in terms of overhead. More in general, achieving an eective division of tasks across layers involves choosing the best place where to implement the services oered. Some services, like reliable delivery, are considered essential in most applications, and thus have to be implemented at some level in the layer stack. Adding a new service has a cost, which can vary widely depending on which layer is chosen for the implementation, the facilities already available in it, and other contingencies. For example, implementing the two-piece gather feature in FM rather than in the ADI improved the overall MPI-FM performance. Since FM is copying the message in the network interface a double word at a time, it does not really make much of a dierence if the message is in two pieces rather than in one contiguous block. Instead, implementing the gather within the ADI in the initial version of MPI-FM exacted the cost of the additional copy of the message pieces into a temporary buer. Final version: MPI-FM with upcall and gather The upcall and the gather are orthogonal techniques, and can be used together. Each reduces the processing overhead on its side of the communication: the gather on the send side, and the upcall on the receive side. Thus, their eects can potentially be cumulative. 5 Measurements and analysis The following sections describe the eects of the gather and the upcall on MPI-FM. The two modications will be presented separately rst, and then combined in what is the nal version of MPI-FM. The performance of each version will be contrasted with that of FM and that of the base version, which represent the performance target and baseline respectively. 5.1 Methodology The network used for the measurements presented in the following is composed of two SPARCstation 20/71's running Solaris 2.4 and connected with a Myrinet network. The network interfaces use the new version of the LANai chip { LANai 3.2, which is a beta version of the LANai 4.0. The compiler used is gcc ver The LANai compiler and Myricom software distribution employed are the 3.0. All measurements have been performed using the timer on the network interfaces (through the MPI Wtime function) which oers microsecond accuracy. Each measurement is repeated a number of times, and the median time is taken, to lter out spurious results due to process descheduling, CPU load surges, or other factors. Measurements were performed in multiuser mode, but while no other users were using the machines, and no big jobs (i.e. with %CPU > 1 as shown by the ps -aux command) were being run. In the measurements, the size of the messages is referred to the payload only (i.e. the size does not include the header). The latency test is an MPI program which measures the time to send a sequence of messages, every time waiting for the message to be sent back. The time is taken for a sequence

17 long enough to reduce the timer granularity error to acceptable levels. The communication primitives used are MPI Send and MPI Recv. The bandwidth test is an MPI program which measures the time to send a sequence of messages back to back from one node to another. The clock is stopped when the last message is acknowledged. The communication primitives used are MPI Send, MPI Irecv and MPI Waitall. The versions of the send and receive operations used in the tests are the most commonly used, and thus are at the highest premium to be ecient and robust. Because of the structure of the MPICH code, their performance is a good indicator of the performance of the other MPI communication versions in MPICH, as most of the other primitives use the same ADI primitives employed by the basic send and receive. This relationship may not be the same in implementations of MPI which exploit special underlying hardware features for collective communication. In all of the tests performed, the timing was such that the receives occurred to be always posted in advance, even without explicit pre-posting or use of barriers. This situation has been explicitly veried, both by instrumenting the code and by writing specic MPI tests. 5.2 The four versions MPI-FM base version performance The comparison of the latency and bandwidth curves of FM with those of the rst version of MPI-FM were presented in Figure 2. The 0-byte latency and the peak bandwidth of MPI-FM are 21 s and 3.9 MB/s respectively, against 14 s and 17.6 MB/s for FM. Most of the FM performance is lost in this version. The dierence in latency, that is, the additional overhead contributed by MPI-FM, grows rapidly with message size. At 256 bytes, the latency of MPI-FM is already one order of magnitude higher than that of FM. The peak bandwidth is reduced to just one fourth of the FM peak value. Since the additional overhead keeps increasing with size, it cannot be amortized even with big messages. To facilitate the evaluation of the improvement introduced by each optimization, these curves will be reported in the next graphs as the starting point and the target respectively. Optimization #1: MPI-FM with upcall The information regarding the appropriate buer to use is available only to the ADI code, which needs to know the identity of the incoming message before it can decide which one is the right buer to give to FM. For this reason, the ADI is interrogated by FM right after the arrival of the rst packet of the message, which contains the header and thus the identity of the message itself. The upcall function is also given the task of removing the header once it has utilized its content. Figure 6 shows the latency and the bandwidth of MPI-FM with the addition of the upcall. Both latency and bandwidth present a noticeable improvement for message size greater than or equal to 12 bytes. Optimization #2: MPI-FM with gather Figure 7 shows the latency and the bandwidth when the simple gather mechanism is added to FM. As described earlier, the new version of FM send() called FM gather24() accepts a message composed of two parts, a 17

18 FM MPI-FM with upcall MPI-FM base us MB/s FM MPI-FM with upcall MPI-FM base K 2K 4K K K 2K 4K K (a) Latency (b) Bandwidth Figure 6: Performance impact of the upcall in MPI-FM header and a payload. The header has a xed length of 24 bytes (to suit the needs of the ADI), the payload can have arbitrary length. The use of the FM gather24() removes the need for the additional copy on the send side, with a visible benet for both latency and bandwidth. Final version: MPI-FM with upcall and gather A global view of the dierent contributions to performance is shown in Figure, in which the graphs of all four dierent versions of MPI-FM are reported. Both optimizations, when individually applied, contribute to improve latency and bandwidth. However it is only when the sender and receiver are both optimized that the full performance of MPI-FM can be achieved. The impact of the joint application of gather and upcall is more than the sum of the individual contributions. The reason has to do with the pipelining of consecutive messages, and is best explained with the help of Figure 9. In this gure the time taken to transfer a message is decomposed into three components, one for each of the processors involved: sender processor cycles (sender overhead), LANai cycles on both interfaces plus network latency, and receiver processor cycles (receiver overhead). When consecutive messages are allowed to overlap, two cases are possible. In the rst case, the sender can start sending a new message as soon as it is done with the previous one, while the receiver is periodically idle, waiting for the arrival of the next one. This is the case of the MPI-FM version with upcall; the bandwidth of the system is still limited by the sender. In the other case, the receiver is slower. In this situation the buers eventually ll up, and the sender is slowed down by the back pressure exerted by the network. A new message 1

19 FM MPI-FM with gather MPI-FM base us MB/s FM MPI-FM with gather MPI-FM base K 2K 4K K K 2K 4K K (a) Latency (b) Bandwidth Figure 7: Performance impact of the gather in MPI-FM 204 us FM MPI-FM with upcall + gather MPI-FM with upcall MPI-FM with gather MPI-FM base MB/s FM MPI-FM with upcall + gather MPI-FM with upcall MPI-FM with gather MPI-FM base K 2K 4K K K 2K 4K K (a) Latency (b) Bandwidth Figure : Comparison of the dierent versions of MPI-FM 19

20 Sender limited bandwidth Receiver limited bandwidth Sender processor overhead LANai + network Receiver processor overhead Figure 9: The pipelining of messages in the bandwidth test will be sent only after the receiver has pulled a prior one out of the network on the other side. This is the version with gather; the system bandwidth is receiver limited. With only one optimization at work, the other unoptimized side of the communication becomes the bottleneck. Both are needed at the same time to achieve the full potential of the network. A more general observation is that a message being transferred from one node to another is actually being pipelined through a series of stages. The stages of the pipeline must be lean and tightly coupled for the sake of latency, and well balanced for the sake of bandwidth. In MPI-FM the pipeline is well balanced, and thus performance is good, over all message sizes. This issue is complicated by the fact that dierent subsystems of a network of workstations are involved. Processors (both on the host and on the network interface), memory hierarchies, I/O buses and network links are the results of dierent technologies, which evolve at dierent speeds. For example, I/O buses change much more slowly among successive generations of workstations than processors or memory architectures. This means that a balanced pipeline on one workstation will not necessarily be so on the next generation. Even dierences in the memory conguration within the same generation can make a big dierence. The nal judgment on how successful the tuning of MPI-FM has been can be made by comparing the curves of the nal version to those of FM. Accounting for a left translation due to the inclusion of the ADI header in every message, the MPI-FM curve substantially reproduces the shape of the FM curve (compare the bumps corresponding to the start of the segmentation and reassembly). The additional MPI-FM overhead is now limited to an approximately constant value of around 6 s. The MPICH/ADI layer is suciently \thin" and well matched to the FM interface that it does not contribute a signicant amount of overhead. This is in line with the objective of the MPI-FM implementation, of having the biggest possible share of FM communication performance delivered to the application. The dierence between latencies is about 6s for short messages. The peak bandwidth of 17.3 closely approaches the 17.6 MB/s of FM. 5.3 MPI-FM performance in perspective To put the results achieved with our version of MPI in the right perspective, we compare its performance to that of MPI libraries running on two mainstream MPPs, the IBM SP2 and the 20

Cross-platform Analysis of Fast Messages for. Universita di Napoli Federico II. fiannello, lauria,

Cross-platform Analysis of Fast Messages for. Universita di Napoli Federico II. fiannello, lauria, Cross-platform Analysis of Fast Messages for Myrinet? Giulio Iannello, Mario Lauria, and Stefano Mercolino Dipartimento di Informatica e Sistemistica Universita di Napoli Federico II via Claudio, 21 {

More information

Performance of a High-Level Parallel Language on a High-Speed Network

Performance of a High-Level Parallel Language on a High-Speed Network Performance of a High-Level Parallel Language on a High-Speed Network Henri Bal Raoul Bhoedjang Rutger Hofman Ceriel Jacobs Koen Langendoen Tim Rühl Kees Verstoep Dept. of Mathematics and Computer Science

More information

MICE: A Prototype MPI Implementation in Converse Environment

MICE: A Prototype MPI Implementation in Converse Environment : A Prototype MPI Implementation in Converse Environment Milind A. Bhandarkar and Laxmikant V. Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

More information

Parallel Computing Trends: from MPPs to NoWs

Parallel Computing Trends: from MPPs to NoWs Parallel Computing Trends: from MPPs to NoWs (from Massively Parallel Processors to Networks of Workstations) Fall Research Forum Oct 18th, 1994 Thorsten von Eicken Department of Computer Science Cornell

More information

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 Network Working Group Request for Comments: 969 David D. Clark Mark L. Lambert Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 1. STATUS OF THIS MEMO This RFC suggests a proposed protocol

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters

Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Component-Based Communication Support for Parallel Applications Running on Workstation Clusters Antônio Augusto Fröhlich 1 and Wolfgang Schröder-Preikschat 2 1 GMD FIRST Kekulésraÿe 7 D-12489 Berlin, Germany

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz. Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations

More information

1.1 Introduction Levels of Comparison Direct Deposit Message Passing (MPI/PVM)... 4

1.1 Introduction Levels of Comparison Direct Deposit Message Passing (MPI/PVM)... 4 Table of Contents 1. A Comparison of Three Gigabit Technologies: SCI, Myrinet and SGI/Cray T3D Christian Kurmann, Thomas Stricker :::::::::::::::::::::::::::::: 1 1.1 Introduction..... 1 1.2 Levels of

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Chapter 17: Distributed Systems (DS)

Chapter 17: Distributed Systems (DS) Chapter 17: Distributed Systems (DS) Silberschatz, Galvin and Gagne 2013 Chapter 17: Distributed Systems Advantages of Distributed Systems Types of Network-Based Operating Systems Network Structure Communication

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Partitioning Effects on MPI LS-DYNA Performance

Partitioning Effects on MPI LS-DYNA Performance Partitioning Effects on MPI LS-DYNA Performance Jeffrey G. Zais IBM 138 Third Street Hudson, WI 5416-1225 zais@us.ibm.com Abbreviations: MPI message-passing interface RISC - reduced instruction set computing

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Welcome to Part 3: Memory Systems and I/O

Welcome to Part 3: Memory Systems and I/O Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Disks and I/O Hakan Uraz - File Organization 1

Disks and I/O Hakan Uraz - File Organization 1 Disks and I/O 2006 Hakan Uraz - File Organization 1 Disk Drive 2006 Hakan Uraz - File Organization 2 Tracks and Sectors on Disk Surface 2006 Hakan Uraz - File Organization 3 A Set of Cylinders on Disk

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

Introduction to Operating. Chapter Chapter

Introduction to Operating. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit) CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2003 Lecture 21: Network Protocols (and 2 Phase Commit) 21.0 Main Point Protocol: agreement between two parties as to

More information

CS610- Computer Network Solved Subjective From Midterm Papers

CS610- Computer Network Solved Subjective From Midterm Papers Solved Subjective From Midterm Papers May 08,2012 MC100401285 Moaaz.pk@gmail.com Mc100401285@gmail.com PSMD01 CS610- Computer Network Midterm Examination - Fall 2011 1. Where are destination and source

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

ECE 650 Systems Programming & Engineering. Spring 2018

ECE 650 Systems Programming & Engineering. Spring 2018 ECE 650 Systems Programming & Engineering Spring 2018 Networking Transport Layer Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) TCP/IP Model 2 Transport Layer Problem solved:

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Virtual Memory Outline

Virtual Memory Outline Virtual Memory Outline Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory Other Considerations Operating-System Examples

More information

Chapter 9: Virtual-Memory

Chapter 9: Virtual-Memory Chapter 9: Virtual-Memory Management Chapter 9: Virtual-Memory Management Background Demand Paging Page Replacement Allocation of Frames Thrashing Other Considerations Silberschatz, Galvin and Gagne 2013

More information

Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).

Notes based on prof. Morris's lecture on scheduling (6.824, fall'02). Scheduling Required reading: Eliminating receive livelock Notes based on prof. Morris's lecture on scheduling (6.824, fall'02). Overview What is scheduling? The OS policies and mechanisms to allocates

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster A Freely Congurable Audio-Mixing Engine with Automatic Loadbalancing M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster Electronics Laboratory, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland

More information

Performance Comparison Between AAL1, AAL2 and AAL5

Performance Comparison Between AAL1, AAL2 and AAL5 The University of Kansas Technical Report Performance Comparison Between AAL1, AAL2 and AAL5 Raghushankar R. Vatte and David W. Petr ITTC-FY1998-TR-13110-03 March 1998 Project Sponsor: Sprint Corporation

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Operating Systems Memory Management. Mathieu Delalandre University of Tours, Tours city, France

Operating Systems Memory Management. Mathieu Delalandre University of Tours, Tours city, France Operating Systems Memory Management Mathieu Delalandre University of Tours, Tours city, France mathieu.delalandre@univ-tours.fr 1 Operating Systems Memory Management 1. Introduction 2. Contiguous memory

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

RICE UNIVERSITY. High Performance MPI Libraries for Ethernet. Supratik Majumder

RICE UNIVERSITY. High Performance MPI Libraries for Ethernet. Supratik Majumder RICE UNIVERSITY High Performance MPI Libraries for Ethernet by Supratik Majumder A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE Approved, Thesis Committee:

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

Lecture 1: January 23

Lecture 1: January 23 CMPSCI 677 Distributed and Operating Systems Spring 2019 Lecture 1: January 23 Lecturer: Prashant Shenoy Scribe: Jonathan Westin (2019), Bin Wang (2018) 1.1 Introduction to the course The lecture started

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 L17 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Was Great Dijkstra a magician?

More information

Introduction to Input and Output

Introduction to Input and Output Introduction to Input and Output The I/O subsystem provides the mechanism for communication between the CPU and the outside world (I/O devices). Design factors: I/O device characteristics (input, output,

More information

Chapter 9: Virtual Memory

Chapter 9: Virtual Memory Chapter 9: Virtual Memory Silberschatz, Galvin and Gagne 2013 Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

DISTRIBUTED EMBEDDED ARCHITECTURES

DISTRIBUTED EMBEDDED ARCHITECTURES DISTRIBUTED EMBEDDED ARCHITECTURES A distributed embedded system can be organized in many different ways, but its basic units are the Processing Elements (PE) and the network as illustrated in Figure.

More information

440GX Application Note

440GX Application Note Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communication from the physical

More information

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message.

Process 0 Process 1 MPI_Barrier MPI_Isend. MPI_Barrier. MPI_Recv. MPI_Wait. MPI_Isend message. header. MPI_Recv. buffer. message. Where's the Overlap? An Analysis of Popular MPI Implementations J.B. White III and S.W. Bova Abstract The MPI 1:1 denition includes routines for nonblocking point-to-point communication that are intended

More information

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996 TR-CS-96-05 The rsync algorithm Andrew Tridgell and Paul Mackerras June 1996 Joint Computer Science Technical Report Series Department of Computer Science Faculty of Engineering and Information Technology

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [NETWORKING] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Why not spawn processes

More information

Lecture 2: September 9

Lecture 2: September 9 CMPSCI 377 Operating Systems Fall 2010 Lecture 2: September 9 Lecturer: Prashant Shenoy TA: Antony Partensky & Tim Wood 2.1 OS & Computer Architecture The operating system is the interface between a user

More information

req unit unit unit ack unit unit ack

req unit unit unit ack unit unit ack The Design and Implementation of ZCRP Zero Copying Reliable Protocol Mikkel Christiansen Jesper Langfeldt Hagen Brian Nielsen Arne Skou Kristian Qvistgaard Skov August 24, 1998 1 Design 1.1 Service specication

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

CSE398: Network Systems Design

CSE398: Network Systems Design CSE398: Network Systems Design Instructor: Dr. Liang Cheng Department of Computer Science and Engineering P.C. Rossin College of Engineering & Applied Science Lehigh University February 23, 2005 Outline

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

1 Introduction Myrinet grew from the results of two ARPA-sponsored projects. Caltech's Mosaic and the USC Information Sciences Institute (USC/ISI) ATO

1 Introduction Myrinet grew from the results of two ARPA-sponsored projects. Caltech's Mosaic and the USC Information Sciences Institute (USC/ISI) ATO An Overview of Myrinet Ralph Zajac Rochester Institute of Technology Dept. of Computer Engineering EECC 756 Multiple Processor Systems Dr. M. Shaaban 5/18/99 Abstract The connections between the processing

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

An Analysis of Object Orientated Methodologies in a Parallel Computing Environment

An Analysis of Object Orientated Methodologies in a Parallel Computing Environment An Analysis of Object Orientated Methodologies in a Parallel Computing Environment Travis Frisinger Computer Science Department University of Wisconsin-Eau Claire Eau Claire, WI 54702 frisintm@uwec.edu

More information

Xinu on the Transputer

Xinu on the Transputer Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1990 Xinu on the Transputer Douglas E. Comer Purdue University, comer@cs.purdue.edu Victor

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

Hardware Implementation of GA.

Hardware Implementation of GA. Chapter 6 Hardware Implementation of GA Matti Tommiska and Jarkko Vuori Helsinki University of Technology Otakaari 5A, FIN-02150 ESPOO, Finland E-mail: Matti.Tommiska@hut.fi, Jarkko.Vuori@hut.fi Abstract.

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment

Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Shared Memory vs. Message Passing: the COMOPS Benchmark Experiment Yong Luo Scientific Computing Group CIC-19 Los Alamos National Laboratory Los Alamos, NM 87545, U.S.A. Email: yongl@lanl.gov, Fax: (505)

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

InfiniBand SDR, DDR, and QDR Technology Guide

InfiniBand SDR, DDR, and QDR Technology Guide White Paper InfiniBand SDR, DDR, and QDR Technology Guide The InfiniBand standard supports single, double, and quadruple data rate that enables an InfiniBand link to transmit more data. This paper discusses

More information

DESIGN AND IMPLEMENTATION OF AN AVIONICS FULL DUPLEX ETHERNET (A664) DATA ACQUISITION SYSTEM

DESIGN AND IMPLEMENTATION OF AN AVIONICS FULL DUPLEX ETHERNET (A664) DATA ACQUISITION SYSTEM DESIGN AND IMPLEMENTATION OF AN AVIONICS FULL DUPLEX ETHERNET (A664) DATA ACQUISITION SYSTEM Alberto Perez, Technical Manager, Test & Integration John Hildin, Director of Network s John Roach, Vice President

More information

RTI Performance on Shared Memory and Message Passing Architectures

RTI Performance on Shared Memory and Message Passing Architectures RTI Performance on Shared Memory and Message Passing Architectures Steve L. Ferenci Richard Fujimoto, PhD College Of Computing Georgia Institute of Technology Atlanta, GA 3332-28 {ferenci,fujimoto}@cc.gatech.edu

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract The Avalanche Myrinet Simulation Package User Manual for V. Chen-Chi Kuo, John B. Carter fchenchi, retracg@cs.utah.edu WWW: http://www.cs.utah.edu/projects/avalanche UUCS-96- Department of Computer Science

More information