Low Latency MPI for Meiko CS/2 and ATM Clusters

Size: px
Start display at page:

Download "Low Latency MPI for Meiko CS/2 and ATM Clusters"

Transcription

1 Low Latency MPI for Meiko CS/2 and ATM Clusters Chris R. Jones Ambuj K. Singh Divyakant Agrawal y Department of Computer Science University of California, Santa Barbara Santa Barbara, CA Abstract MPI (Message Passing Interface) is a proposed message passing standard for development of efficient and portable parallel programs. An implementation of MPI is presented and evaluated for the Meiko CS/2, a 64 node parallel computer, and a network of 8 SGI workstations connected by an ATM switch and Ethernet. 1. Introduction A major hurdle of the message passing paradigm has been the lack of a standard library supporting features such as point-to-point and collective communication, tagged message delivery, and synchronization primitives. This motivated a consortium of researchers and practitioners in the parallel computing arena to develop a standard for message passing, Message Passing Interface (MPI) [12]. Rather than adopt one of the existing message passing libraries, they chose to create a standard with all of the above features by integrating the features provided by existing message passing libraries such as PVM [8], NX/2 [9], p4 [6] and PAR- MACS [7]. Implementations of high-level message passing libraries such as MPI are often significantly less efficient than lower level libraries because of their support of high-level programming features, and their failure to exploit specific architectural details. In implementing message passing libraries, implementors often have to trade off between low latency for small messages and high bandwidth for large. This paper examines the necessary overheads in implementing the MPI library on top of existing user level libraries for two platforms: a 64 node Meiko CS-2, and a cluster of Silicon Graphics workstations connected by both Ethernet and an ATM (Asynchronous Transfer Mode) switch. The focus This research was supported by the NSF under CCR , CDA , and CCR y Work supported by LANL under UC94-B-A-223 and by a research gift from NEC Japan. of the research is on reducing latencies in a tagged message passing model of MPI. In MPI, a sending process issues a send operation, which eventually transfers data to a process that issues a matching receive operation. Each process might issue several sends or several receives, so the processes are responsible for matching corresponding send and receive operations. A message communication in MPI therefore involves each send operation sending its envelope to the receiver, the matching of these envelopes to receive tags, and the eventual transfer of data from the sender's buffer to the receiver's buffer. This paper examines methods for the sending of envelopes, the matching of sends and receives, and the sending of data. To minimize latency, a mechanism is proposed which overlaps the transfer of data and send envelopes, buffering data temporarily when necessary at the receiver. However, for large amounts of data, the temporary buffering becomes a major overhead and can limit bandwidth. So, we propose a hybrid implementation in which, above a certain message size threshold, the sending and matching of envelopes occurs first, and then a DMA (direct memory access) from the sender to the receiver is initiated without intermediate buffering. 2. Related Work MPICH [2] is an implementation of MPI designed for portability that provides a well-defined device layer allowing for easy implementation on new architectures. It has been implemented for a variety of architectures through the widely portable p4 [6] communications layer, and also includes specialized device layers for many parallel machines, including the Meiko CS2. xmpi [15] extends MPICH by providing a device driver on top of xkernel, a framework for implementing network protocols in the kernel more efficiently than existing frameworks, such as STREAMS. As xkernel can be used to implement kernel level network protocols, it could provide more efficient overall implementations than other kernel level protocols, such as those described in this paper. The authors propose an MPI imple-

2 mentation over ATM AAL5. The Meiko CS2 has specialized hardware for providing secure user level communication using a high speed network [13]. The Meiko implementation of MPI included with the MPICH distribution is based on a tagged message passing widget provided by Meiko, called the tport widget. The tport widget, however, trades off latency in providing high bandwidth communication, and provides no support for collective communications, requiring that MPICH implement collective communications on top of point-to-point mechanisms. LAM [5] is a parallel software environment that implements MPI on top of its own existing features. Those MPI features which have no close match are independently implemented. Like MPI, it provides a rich environment for parallel programming. The MPI extension allows users to use LAM features as well. Efficient implementation of the MPI collective communications library on top of Ethernet is examined by Bruck, Dolev, Ho, Rosu and Strong [3]. A user level reliable transport protocol is given which uses the broadcast nature of the Ethernet for efficiency. Like the broadcast mechanism discussed in this paper, the exploitation of hardware broadcast gives a more efficient implementation than would be possible using only point-to-point communication. MPI's message delivery guarantees can be unrealizable due to limited resources for message envelopes. Burns and Daoud [4] discuss tactics for overflow detection and reporting in MPI implementations. Basu, Buch, Vogels and Von Eicken [1], as well as Thekkath, Nguyen, Moy and Lazowska [17] discuss the inefficiency of implementing networking protocols in the kernel. They discuss methods for moving parts of the networking protocol into the user level, leaving only the necessary security mechanisms in the kernel. Both papers give implementations of user level networking protocols for ATM, showing significant improvements over kernel level implementations. Implementation of a minimal latency DMA implementation for the ATM is given by Thekkath, Levy and Lazowska [16]. The DMA mechanism is the heart of communication for the Meiko, so a DMA mechanism such as this could be used in conjunction with the Meiko implementation discussed in this paper for a high performance ATM implementation. 3. MPI Standard MPI [12] is a communication standard to support development of portable parallel programs. The formal MPI specifications define the following primitives: point-topoint communication, collective communication, process group management and virtual topology management, environmental management and a profiling interface. In this paper, we restrict our attention primarily to the point-to-point communication primitives: MPI_Send(buffer, count, datatype, dest, tag, communicator) MPI_Recv(buffer, count, datatype, source, tag, communicator, status) There are several variants of the MPI Send call: a buffered mode, synchronous mode, and ready mode, each of which has its own blocking and nonblocking variant. The buffered send, MPI Bsend, ensures that the completion of the send operation does not depend on the posting of the receive operation. Once the buffered send has been posted, it will complete regardless of the activities of the receiver. The user is responsible for providing the buffer space for the buffered send. As long as this buffer space is monitored to make sure space is available, the send is guaranteed to be buffered correctly. The synchronous mode send, MPI Ssend, adds the requirement that the sender must wait until the receiver has posted the matching receive, and the send and receive have completed, with the sender's data having been successfully transferred to the receiver's buffer. This allows the sender to know the exact point at which the data transfer occurred. The ready mode send, MPI Rsend, allows the programmer to give more information to MPI, by informing it that the sender knows that the receiver has already posted a receive. As far as receive operations go, there are no special modes; all receives match any type of send. However, there is an operation similar to MPI Recv, MPI Probe, which polls for incoming messages without actually completing the data-transfer. A nonblocking version of this operation, MPI IProbe, checks to see if a message is available without blocking. The MPI Bcast call supplies a broadcast mechanism, used when a single processor wishes to send data to all other processes in a group. We have implemented MPI point-to-point and the broadcast collective communication primitive on the Meiko and the ATM cluster. The implementation of broadcast on Meiko uses the underlying hardware broadcast mechanism, whereas on the ATM network, it uses a succession of point-to-point messages. 4. Implementation of MPI on Meiko 4.1. Design Motivations As discussed above, MPI point-to-point communication involves transmitting a send envelope, matching a send with a receive, and transferring data from the sender's buffer to the receiver's buffer. There are several choices as to how and when these operations are performed. We discuss some

3 of these options next. The ability of a receive operation to specify MPI ANY SOURCE requires that the matching be done at the receiver. Waiting for the data transfer until a match will result in high latencies, especially for small messages. This is because the send's envelope must first be sent to the receiver, a match must be made at the receiver, a second trip across the network is necessary to request the data from the sender, and finally another trip is necessary for the sender to actually send the DMA data. If the extra latency is avoided by initiating a transfer simultaneously with a match operation, then temporary buffers will be needed, thus increasing the space requirements. We use a hybrid approach in which the transfer of small messages is overlapped with the matching operation, and the transfer of large messages occurs after the match operation. The crossover point is determined by examining the various latencies. As far as the matching operation at the receiver goes, we still need to decide whether to use the SPARC (the main 40 MHz processor on a Meiko node) or the Elan (10 MHz communications co-processor) to perform the matching. If we are to match sends and receives only with the SPARC processor, data transfers will not complete until the application program issues an MPI Wait or MPI Test operation on a nonblocking operation. For instance, if the receiver has issued a nonblocking operation, and the sender issues a synchronous send, the sender will have to wait for the receiver to issue a completion operation on its receive. To handle this problem, we could utilize the Elan co-processor for matching of sends and receives in the background. However, the slower Elan may not be able to handle the somewhat intensive message matching as quickly as the faster SPARC, so latency could increase. Additionally, since the SPARC must eventually be informed of the completion of a receive, the extra synchronization between the Elan and SPARC can also increase latency. The existing ANL/MSU MPICH implementation [2] on top of tport widget uses the Elan for matching in the background. For the sake of comparison, we implemented our matching of sends and receives on the SPARC. Regardless of matching at the receiver, we would also like the sending processor to be able to issue nonblocking sends very quickly, freeing the processor for other responsibilities. So, we use the Elan to perform this sending in the background, allowing the SPARC to perform other computation. The actual transfer of data from the sender requires allocation of memory at the receiver. The expense of these mechanisms has been researched in detail in an implementation of Active Messages for the Meiko CS2 [14]. In order to minimize latency, we allocate space for a single send envelope for each sending processor at each receiver. Round-trip time (us) Round-trip time (us) Buffering No buffering Figure 1. Meiko transfer mechanisms MPI(mpich) MPI(low latency) Meiko tport Figure 2. Meiko round-trip latency 4.2. Meiko results As mentioned before, data can be optimistically transferred before the match to a buffer at the receiver, or after the match is made directly to the receiver's address space. The comparison of these two mechanisms is shown in detail in Figure 1. The intersection of the two lines occurs at a message length of 180 bytes. Based on this we choose 180 bytes as the crossover point between the two implementations. The mechanism for small transfers allows much lower latencies than the tport mechanism for certain applications. Figure 2 shows the round-trip latency times for varying transfer sizes. Three plots are shown: our MPI implementation where matching is done entirely on the SPARC, the MPICH tport-based MPI implementation which does matching mostly on the Elan, and a plot of Meiko's tport widget without any MPI overheads. As Meiko's tport widget provides simplified tagged message passing directly, it is the lowest latency mechanism, with a 1 byte round-trip latency of 52 s. The two MPI implementations add significant overheads to this minimal cost by adding the MPI features of communicators, datatypes and different modes of communication. As

4 40 35 MPI(mpich) MPI(low latency) Meiko tport TCP UDP Fore aal4 Throughput (MB/s) Round-trip time (us) Figure 3. Meiko bandwidth Figure 4. ATM round-trip latency MPICH implements MPI directly on top of the tport widget, it can be seen that MPICH adds 158 s to the 1 byte roundtrip latency. Our lower latency SPARC matching implementation decreases these overheads by providing an implementation directly over Meiko DMAs and transactions, decreasing latency significantly with a total round-trip time of 104 s, about 52 s higher than the tport widget. The bend in the curve for the SPARC matching MPI implementation which is visible around 180 bytes clearly shows where the high bandwidth mechanism takes over, and DMAs are invoked by the receiver after the matching of the send and receive has taken place. To demonstrate that this implementation still maintains high bandwidth for large transfers, Figure 3 plots bandwidth versus message size for larger messages with the two MPI implementations and the Meiko tport. As seen in the figure, the best possible DMA provided bandwidth of 39MBytes/s is nearly reached, and bandwidth is in fact increased as a result of decreasing latency for the SPARC matching implementation. 5. ATM Implementation The ATM cluster examined here consists of eight SGI Indy 133-MHz workstations and an SGI Challenge SMP 150-MHz dual-processor. All of these machines have 64 MBytes of RAM, and are connected via a 10 Mbit/s Ethernet and 155 Mbit/s ATM channels. The ATM switch is a Fore Systems ForeRunner ASX-200 with eight 155 Mbit/s ports. Each SGI is connected to the switch by a Fore GIA- 200 interface card. These interface cards include an Intel i960 processor dedicated to segmentation and reassembly for the AAL3/4 and AAL5 protocols without using the main processor. Four different user level protocols are available to run in this environment. TCP/IP is available with two different signaling protocols to establish the connection. The first, Classical IP, is the standard signaling protocol defined by the ATM Forum. The second, SPANS, is Fore's own signaling protocol. The differences in these protocols only affect connection establishment. In the implementations we will examine, connections are static, so connection setup time is not of major importance. Hence, this paper only examines classical IP. Fore also offers an API which provides communication on top of ATM adaptation layers 3 and 4 (which are treated identically), and AAL5. As our goal is a low latency implementation of MPI, it is worth considering the latency of a packet over these different protocols. We would expect that the Fore adaptation layers might be significantly faster, since they have very few overheads. Unfortunately, they are not significantly faster than the Fore TCP or UDP implementations. This is because of the overheads involved in the streams protocol layers used by the Fore API. Figure 4 shows a comparison of the latencies for Fore's implementation of AAL4, TCP and UDP. Except for small message sizes, the latency of these protocols are indistinguishable from each other. This prompted us to confine our attention only to TCP/IP and UDP as the underlying communication protocols. Our measurements were made using TCP/IP and UDP on an Ethernet as well as the described ATM network TCP Implementation The facilities for communication provided by TCP/IP and the Meiko are quite different. While Meiko provides mechanisms for manipulation of remote data through DMAs or remote transactions, TCP/IP provides a much different communication mechanism, that of a reliable stream of data between two processes on opposite ends of a channel. However, many of the issues we confronted during the implementation of MPI for the Meiko are common, such as the need to match MPI Sends to MPI Receives. In order to reuse the components of our MPI implementation on the Meiko, we decided to implement the underlying primitives (assumed in the Meiko implementation) on top of TCP.

5 Essentially, we needed to implement a method of sending an envelope, a method of sending an envelope with piggybacked data, and a method for setting remote events and sending DMA data. As the communication latencies are quite large when TCP is used, piggybacking data is more important than in the Meiko implementation. As such, the buffer flow control mechanism used in the Meiko implementation is inappropriate, since it assumes only a single outstanding message at any given time. A window protocol would be ideal allowing multiple outstanding messages. Unfortunately, since not all MPI messages can be ordered on the same FIFO because of tags and communicators, standard window protocols are inappropriate. So, we have the receiver keep a reserved amount of memory for each sender, to which the sender sends data optimistically. Once freed, the receiver informs the sender that the space can be reused. This allows the sender to optimistically send many messages as long as it knows that free space is available at the receiver TCP Results Roundtrip time (us) mpi/tcp/atm mpi/tcp/eth tcp/atm tcp/eth Figure 5. TCP round-trip latency Figure 5 displays the round-trip latency times for messages with both TCP and MPI over TCP. Four plots are shown, two for TCP and MPI over Ethernet and two for ATM. There is approximately a 150 s higher round-trip latency for MPI implementations over TCP. This overhead is caused by the additional transfer of envelopes and control information, and the cost of performing message matching. Table 1 displays the breakdown of these overheads for a 1 byte message. The largest overhead is the 925 s costof a round-trip message over Ethernet, and 1065 s foratm. The next line in the table is the cost of sending 25 bytes of MPI protocol information. (Of the 25 bytes, 1 byte designates the type of message, such as envelope, or DMA. 4 bytes are included for telling the destination how much reserved space has been freed. The last 20 bytes are used for the envelope, and DMA request information.) The amount of this overhead is 45 s on Ethernet and about 5 s on ATM. The next line measures the overhead of determining the incoming message type. This was measured to be 65 s on Ethernet and slightly higher (85 s) on ATM. The next line measures the overhead of receiving the actual envelope once the message type has been determined. Once more, they are 65 s on Ethernet and 85 s on ATM. These costs are so high because the associated operations need to cross the kernel boundary. The last 35 s overhead is the cost of actually performing MPI matching. (The cost for receiving the actual data is already included in the latency figures of the first line.) Figure 6 displays the bandwidth obtained using TCP as the communications layer. For the sake of completeness, we implemented MPI by using the UDP transport level interface. The UDP imple- Throughput (MB/s) mpi/tcp/atm mpi/tcp/eth tcp/atm tcp/eth Figure 6. TCP bandwidth ATM Ethernet Overhead 1065 s 925 s 1 byte round-trip latency 5 s 45 s 25 byte info overhead 85 s 65 s Read for msg type 85 s 65 s Read for envelope 35 s 35 s Overheads for matching Table 1. MPI round-trip overheads with TCP

6 1 0.9 mpich low latency mpich low latency time (s) time (us) # Processes Number of processors Figure 7. Meiko Linear Equation Solver Figure 8. Meiko Particle Pairwise Interactions mentation is very similar to the one with TCP with additional measures taken to make the UDP communication reliable. As a consequence of the overhead that arises to make the UDP connection reliable, our results indicated that the performance of the UDP implementation was very similar to that of TCP [10]. Time (us) Ethernet ATM 6. Applications Linear Equation Solver A linear equation solver for N variables has been implemented which solves the equation with an initial phase of computation by the initiator, N phases of broadcasting and computation by all processes, and a final phase of result gathering by the initiator. As the only communication mechanism involved here is the broadcast, the MPI-based program uses the collective communication primitives implemented using Meiko's hardware broadcast mechanism. The results for the Meiko are shown in Figure 7, which givestimes forsolvinga linear system using from 1 to 32 processes. Two plots are show, one for the MPICH implementation, which implements broadcast using pointto-point messages, and our implementation. We also implemented matrix multiplication; the performance results are similar to that of the linear equation solver Pairwise Interactions Another problem which exploits the use of a parallel computer well is molecular dynamics, where we need to compute interacting forces within a group of particles. As each particle interacts with every other particle in the group, O(n 2 ) interactions must be calculated. To parallelize this problem, each processor is in charge of calculating the interactions of P=N particles where N is the numberof processors. The processes communicate in P, 1 phases, passing Number of Processors Figure 9. TCP Particle Pairwise Interactions a partition of the particles around in the ring. Each process calculates the interactions between the particles it is permanently assigned to and the partition of particles that it has in the current round. Messages are simply passed in a ring, requiring only point-to-point messages. To allow concurrent sending and receiving at the communication phase of each round, nonblocking sends are posted to send to the next processor in the ring, then a blocking receive is performed, followed by a wait operation to complete the send. Figure 8 shows the results of running this program to find the forces acting on each of 24 particles, using up to 8 processes of the Meiko. As each processor has a fairly even load, the processes tend to interact at nearly the same time, so a lower latency communication mechanism is beneficial. The high latencies of TCP on the cluster of workstations make this problem scale well only for much larger problems sets. Figure 9 shows the results of finding the forces acting on 128 particles. The ATM shows a clear performance gain, primarily because there is no network contention and fairly large messages are used, exploiting ATM' s higher bandwidth.

7 7. Conclusions The MPI standard is intended to provide a widely portable message passing interface. Its semantics leave many efficiency issues to the implementor. We have shown three elements that can affect performance on different architectures: the use of a communications co-processor for background processing, the buffering of data at the receiver, and the overlapping of data transfer with message matching. A mechanism was proposed that takes these issues into account, giving efficient implementations on the Meiko CS-2, and a cluster of workstations communicating with TCP or UDP on Ethernet or ATM. References [1] Basu, A., Buch, V., Vogels, W., and Von Eicken, T., U- Net: A User-Level Network Interface for Parallel and Distributed Computing Proceedings of the 15th ACM Symposium on Operating Systems Principles, December [2] Bridges, P., Doss, N., Gropp, W. Karrels, E., Lusk, E., and Skjellum, A., Users' Guide to MPICH: A Portable Implementation of MPI, Argonne National Laboratory, May [3] Bruck, J., Dolev, D., Ho, C., Rosu, M. and Strong, R., Efficient Message Passing Interface (MPI) for Parallel Computing on Clusters of Workstations ACM Symposium on Parallel Algorithms and Architectures, July [4] Burns, G. and Daoud, R. Robust MPI Message Delivery with Guaranteed Resources Proceedings of the MPI Developers Conference, June [9] ipsc/2 and ipsc/860 User' s Guide, Intel Corporation, Order Number , April [10] Jones, Chris. Low Latency MPI for Meiko CS-2 and ATM Clusters, MS Thesis, Department of Computer Science, University of California at Santa Barbara, July [11] Lin, M., Du, D. H. C., Thomas, J. P. and McDonald, J. A. Distributed Network Computing over Local ATM Networks. IEEE Journal on Selected Areas in Communications, Vol. 13, No. 4, May [12] Message Passing Interface Forum, MPI: A Message- Passing Interface Standard, Version 1.1, June [13] CS-2 Documentation Set, Meiko World Incorporated, [14] Schauser, K. and Scheiman, C., Experience with Active Messages on the Meiko CS-2 9th International Parallel Processing Symposium, Santa Barbara CA, April [15] Singhai, A. and Campbell, R. H. xmpi - An Implementation over x-kernel for ATM Networks Proceedings of the MPI Developers Conference, June [16] Thekkath, C. A., Levy, H.M. and Lazowska, E. D., Efficient Support for Multicomputing on ATM Networks. Technical Report TR Department of Computer Science and Engineering, University of Washington, April [17] Thekkath, C.A., Nguyen, T. D., Moy, E., and Lazowska, E.D., Implementing Network Protocols at User Level. IEEE/ACM Transactions on Networking. Vol 1. No. 5, October [5] Burns, G., Daoud, R. and Jaigl, J., LAM: An Open Cluster Environment for MPI, Ohio Supercomputing Center, [6] Butler, R., and Lusk, E., User's guide to the p4 parallel programming system. Technical Report ANL-92/17, Argonne National Laboratory, [7] Calvin, R., Hempel, R., Hoppe, H., Wypior, P. Portable programming with the PARMACS Message- Passing Library. Parallel Computing, Special Issue on Message Passing Interfaces 20 (April 1994), [8] Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., and Sunderam, V., PVM 3.0 User' s Guide and Reference Manual. Oak Ridge National Laboratory. Technical Report ORNL/TM , February 1993.

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University

AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1. October 4 th, Department of Computer Science, Cornell University AN O/S PERSPECTIVE ON NETWORKS Adem Efe Gencer 1 October 4 th, 2012 1 Department of Computer Science, Cornell University Papers 2 Active Messages: A Mechanism for Integrated Communication and Control,

More information

Developing a Thin and High Performance Implementation of Message Passing Interface 1

Developing a Thin and High Performance Implementation of Message Passing Interface 1 Developing a Thin and High Performance Implementation of Message Passing Interface 1 Theewara Vorakosit and Putchong Uthayopas Parallel Research Group Computer and Network System Research Laboratory Department

More information

Multicast can be implemented here

Multicast can be implemented here MPI Collective Operations over IP Multicast? Hsiang Ann Chen, Yvette O. Carrasco, and Amy W. Apon Computer Science and Computer Engineering University of Arkansas Fayetteville, Arkansas, U.S.A fhachen,yochoa,aapong@comp.uark.edu

More information

Ethan Kao CS 6410 Oct. 18 th 2011

Ethan Kao CS 6410 Oct. 18 th 2011 Ethan Kao CS 6410 Oct. 18 th 2011 Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. In Proceedings

More information

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press,  ISSN Finite difference and finite element analyses using a cluster of workstations K.P. Wang, J.C. Bruch, Jr. Department of Mechanical and Environmental Engineering, q/ca/z/brm'a, 5Wa jbw6wa CW 937% Abstract

More information

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2

Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2 Low-Latency Message Passing on Workstation Clusters using SCRAMNet 1 2 Vijay Moorthy, Matthew G. Jacunski, Manoj Pillai,Peter, P. Ware, Dhabaleswar K. Panda, Thomas W. Page Jr., P. Sadayappan, V. Nagarajan

More information

Profile-Based Load Balancing for Heterogeneous Clusters *

Profile-Based Load Balancing for Heterogeneous Clusters * Profile-Based Load Balancing for Heterogeneous Clusters * M. Banikazemi, S. Prabhu, J. Sampathkumar, D. K. Panda, T. W. Page and P. Sadayappan Dept. of Computer and Information Science The Ohio State University

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Group Management Schemes for Implementing MPI Collective Communication over IP Multicast Xin Yuan Scott Daniels Ahmad Faraj Amit Karwande Department of Computer Science, Florida State University, Tallahassee,

More information

An O/S perspective on networks: Active Messages and U-Net

An O/S perspective on networks: Active Messages and U-Net An O/S perspective on networks: Active Messages and U-Net Theo Jepsen Cornell University 17 October 2013 Theo Jepsen (Cornell University) CS 6410: Advanced Systems 17 October 2013 1 / 30 Brief History

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

THE U-NET USER-LEVEL NETWORK ARCHITECTURE. Joint work with Werner Vogels, Anindya Basu, and Vineet Buch. or: it s easy to buy high-speed networks

THE U-NET USER-LEVEL NETWORK ARCHITECTURE. Joint work with Werner Vogels, Anindya Basu, and Vineet Buch. or: it s easy to buy high-speed networks Thorsten von Eicken Dept of Computer Science tve@cs.cornell.edu Cornell niversity THE -NET SER-LEVEL NETWORK ARCHITECTRE or: it s easy to buy high-speed networks but making them work is another story NoW

More information

Low-Latency Communication over Fast Ethernet

Low-Latency Communication over Fast Ethernet Low-Latency Communication over Fast Ethernet Matt Welsh, Anindya Basu, and Thorsten von Eicken {mdw,basu,tve}@cs.cornell.edu Department of Computer Science Cornell University, Ithaca, NY 14853 http://www.cs.cornell.edu/info/projects/u-net

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large

More information

Low-Latency Communication on the IBM RISC System/6000 SP

Low-Latency Communication on the IBM RISC System/6000 SP Low-Latency Communication on the IBM RISC System/6000 SP Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel and Thorsten von Eicken Department of Computer Science Cornell University Ithaca NY 1483 Abstract

More information

MICE: A Prototype MPI Implementation in Converse Environment

MICE: A Prototype MPI Implementation in Converse Environment : A Prototype MPI Implementation in Converse Environment Milind A. Bhandarkar and Laxmikant V. Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

More information

LogP Performance Assessment of Fast Network Interfaces

LogP Performance Assessment of Fast Network Interfaces November 22, 1995 LogP Performance Assessment of Fast Network Interfaces David Culler, Lok Tin Liu, Richard P. Martin, and Chad Yoshikawa Computer Science Division University of California, Berkeley Abstract

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

TMD-MPI: AN MPI IMPLEMENTATION FOR MULTIPLE PROCESSORS ACROSS MULTIPLE FPGAS. Manuel Saldaña and Paul Chow

TMD-MPI: AN MPI IMPLEMENTATION FOR MULTIPLE PROCESSORS ACROSS MULTIPLE FPGAS. Manuel Saldaña and Paul Chow TMD-MPI: AN MPI IMPLEMENTATION FOR MULTIPLE PROCESSORS ACROSS MULTIPLE FPGAS Manuel Saldaña and Paul Chow Department of Electrical and Computer Engineering University of Toronto Toronto, ON, Canada M5S

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

ET4254 Communications and Networking 1

ET4254 Communications and Networking 1 Topic 9 Internet Protocols Aims:- basic protocol functions internetworking principles connectionless internetworking IP IPv6 IPSec 1 Protocol Functions have a small set of functions that form basis of

More information

POM: a Virtual Parallel Machine Featuring Observation Mechanisms

POM: a Virtual Parallel Machine Featuring Observation Mechanisms POM: a Virtual Parallel Machine Featuring Observation Mechanisms Frédéric Guidec, Yves Mahéo To cite this version: Frédéric Guidec, Yves Mahéo. POM: a Virtual Parallel Machine Featuring Observation Mechanisms.

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS CS6410 Moontae Lee (Nov 20, 2014) Part 1 Overview 00 Background User-level Networking (U-Net) Remote Direct Memory Access

More information

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP CS 5520/ECE 5590NA: Network Architecture I Spring 2008 Lecture 13: UDP and TCP Most recent lectures discussed mechanisms to make better use of the IP address space, Internet control messages, and layering

More information

Performance of the MP_Lite message-passing library on Linux clusters

Performance of the MP_Lite message-passing library on Linux clusters Performance of the MP_Lite message-passing library on Linux clusters Dave Turner, Weiyi Chen and Ricky Kendall Scalable Computing Laboratory, Ames Laboratory, USA Abstract MP_Lite is a light-weight message-passing

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s Theme 2 Message passing model (abstraction) Threads

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

Loaded: Server Load Balancing for IPv6

Loaded: Server Load Balancing for IPv6 Loaded: Server Load Balancing for IPv6 Sven Friedrich, Sebastian Krahmer, Lars Schneidenbach, Bettina Schnor Institute of Computer Science University Potsdam Potsdam, Germany fsfried, krahmer, lschneid,

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Demultiplexing on the ATM Adapter: Experiments withinternetprotocolsinuserspace

Demultiplexing on the ATM Adapter: Experiments withinternetprotocolsinuserspace Demultiplexing on the ATM Adapter: Experiments withinternetprotocolsinuserspace Ernst W. Biersack, Erich Rütsche B.P. 193 06904 Sophia Antipolis, Cedex FRANCE e-mail: erbi@eurecom.fr, rue@zh.xmit.ch Abstract

More information

RTI Performance on Shared Memory and Message Passing Architectures

RTI Performance on Shared Memory and Message Passing Architectures RTI Performance on Shared Memory and Message Passing Architectures Steve L. Ferenci Richard Fujimoto, PhD College Of Computing Georgia Institute of Technology Atlanta, GA 3332-28 {ferenci,fujimoto}@cc.gatech.edu

More information

Part 5: Link Layer Technologies. CSE 3461: Introduction to Computer Networking Reading: Chapter 5, Kurose and Ross

Part 5: Link Layer Technologies. CSE 3461: Introduction to Computer Networking Reading: Chapter 5, Kurose and Ross Part 5: Link Layer Technologies CSE 3461: Introduction to Computer Networking Reading: Chapter 5, Kurose and Ross 1 Outline PPP ATM X.25 Frame Relay 2 Point to Point Data Link Control One sender, one receiver,

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies

Data Link Layer. Our goals: understand principles behind data link layer services: instantiation and implementation of various link layer technologies Data Link Layer Our goals: understand principles behind data link layer services: link layer addressing instantiation and implementation of various link layer technologies 1 Outline Introduction and services

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne Distributed Computing: PVM, MPI, and MOSIX Multiple Processor Systems Dr. Shaaban Judd E.N. Jenne May 21, 1999 Abstract: Distributed computing is emerging as the preferred means of supporting parallel

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

Efficient Signaling Algorithms for ATM Networks

Efficient Signaling Algorithms for ATM Networks Efficient Signaling Algorithms for ATM Networks See-Mong Tan Roy H. Campbell Department of Computer Science University of Illinois at Urbana-Champaign 1304 W. Springfield Urbana, IL 61801 stan,roy @cs.uiuc.edu

More information

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters

Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters Title Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters Author(s) Wong, KP; Wang, CL Citation International Conference on Parallel Processing Proceedings, Aizu-Wakamatsu

More information

Communication Kernel for High Speed Networks in the Parallel Environment LANDA-HSN

Communication Kernel for High Speed Networks in the Parallel Environment LANDA-HSN Communication Kernel for High Speed Networks in the Parallel Environment LANDA-HSN Thierry Monteil, Jean Marie Garcia, David Gauchard, Olivier Brun LAAS-CNRS 7 avenue du Colonel Roche 3077 Toulouse, France

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

by Brian Hausauer, Chief Architect, NetEffect, Inc

by Brian Hausauer, Chief Architect, NetEffect, Inc iwarp Ethernet: Eliminating Overhead In Data Center Designs Latest extensions to Ethernet virtually eliminate the overhead associated with transport processing, intermediate buffer copies, and application

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

CS 428/528 Computer Networks Lecture 01. Yan Wang

CS 428/528 Computer Networks Lecture 01. Yan Wang 1 CS 428/528 Computer Lecture 01 Yan Wang 2 Motivation: Why bother? Explosive growth of networks 1989, 100,000 hosts on the Internet Distributed Applications and Systems E-mail, WWW, multimedia, distributed

More information

Week 2 / Paper 1. The Design Philosophy of the DARPA Internet Protocols

Week 2 / Paper 1. The Design Philosophy of the DARPA Internet Protocols Week 2 / Paper 1 The Design Philosophy of the DARPA Internet Protocols David D. Clark ACM CCR, Vol. 18, No. 4, August 1988 Main point Many papers describe how the Internet Protocols work But why do they

More information

Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments

Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments Evaluating Personal High Performance Computing with PVM on Windows and LINUX Environments Paulo S. Souza * Luciano J. Senger ** Marcos J. Santana ** Regina C. Santana ** e-mails: {pssouza, ljsenger, mjs,

More information

interfaces. Originally developed for the CM-5 [1], implementations are also available for the Meiko CS-[10], HP workstations on FDDI ring [9], Intel P

interfaces. Originally developed for the CM-5 [1], implementations are also available for the Meiko CS-[10], HP workstations on FDDI ring [9], Intel P Low-Latency Communication on the IBM RISC System/6000 SP y Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel, and Thorsten von Eicken Department of Computer Science Cornell University Ithaca, NY 14853

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Building MPI for Multi-Programming Systems using Implicit Information

Building MPI for Multi-Programming Systems using Implicit Information Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley

More information

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Dept. of Computer Science Florida State University Tallahassee, FL 32306 {karwande,xyuan}@cs.fsu.edu

More information

Internetworking Part 1

Internetworking Part 1 CMPE 344 Computer Networks Spring 2012 Internetworking Part 1 Reading: Peterson and Davie, 3.1 22/03/2012 1 Not all networks are directly connected Limit to how many hosts can be attached Point-to-point:

More information

Point-to-Point Communication. Reference:

Point-to-Point Communication. Reference: Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point

More information

Adaptive RTP Rate Control Method

Adaptive RTP Rate Control Method 2011 35th IEEE Annual Computer Software and Applications Conference Workshops Adaptive RTP Rate Control Method Uras Tos Department of Computer Engineering Izmir Institute of Technology Izmir, Turkey urastos@iyte.edu.tr

More information

Parallel Computing Trends: from MPPs to NoWs

Parallel Computing Trends: from MPPs to NoWs Parallel Computing Trends: from MPPs to NoWs (from Massively Parallel Processors to Networks of Workstations) Fall Research Forum Oct 18th, 1994 Thorsten von Eicken Department of Computer Science Cornell

More information

ECE 650 Systems Programming & Engineering. Spring 2018

ECE 650 Systems Programming & Engineering. Spring 2018 ECE 650 Systems Programming & Engineering Spring 2018 Networking Transport Layer Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) TCP/IP Model 2 Transport Layer Problem solved:

More information

Performance Modeling and Evaluation of MPI

Performance Modeling and Evaluation of MPI Performance Modeling and Evaluation of MPI Khalid Al-Tawil Csaba Andras Moritz y Abstract Users of parallel machines need to have a good grasp for how different communication patterns and styles affect

More information

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

PM2: High Performance Communication Middleware for Heterogeneous Network Environments PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,

More information

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation Achieving Distributed Buffering in Multi-path Routing using Fair Allocation Ali Al-Dhaher, Tricha Anjali Department of Electrical and Computer Engineering Illinois Institute of Technology Chicago, Illinois

More information

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

PARA++ : C++ Bindings for Message Passing Libraries

PARA++ : C++ Bindings for Message Passing Libraries PARA++ : C++ Bindings for Message Passing Libraries O. Coulaud, E. Dillon {Olivier.Coulaud, Eric.Dillon}@loria.fr INRIA-lorraine BP101, 54602 VILLERS-les-NANCY, FRANCE Abstract The aim of Para++ is to

More information

Performance of a High-Level Parallel Language on a High-Speed Network

Performance of a High-Level Parallel Language on a High-Speed Network Performance of a High-Level Parallel Language on a High-Speed Network Henri Bal Raoul Bhoedjang Rutger Hofman Ceriel Jacobs Koen Langendoen Tim Rühl Kees Verstoep Dept. of Mathematics and Computer Science

More information

Recently, symmetric multiprocessor systems have become

Recently, symmetric multiprocessor systems have become Global Broadcast Argy Krikelis Aspex Microsystems Ltd. Brunel University Uxbridge, Middlesex, UK argy.krikelis@aspex.co.uk COMPaS: a PC-based SMP cluster Mitsuhisa Sato, Real World Computing Partnership,

More information

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI

Comparing the performance of MPICH with Cray s MPI and with SGI s MPI CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 3; 5:779 8 (DOI:./cpe.79) Comparing the performance of with Cray s MPI and with SGI s MPI Glenn R. Luecke,, Marina

More information

INTRODUCTORY COMPUTER

INTRODUCTORY COMPUTER INTRODUCTORY COMPUTER NETWORKS TYPES OF NETWORKS Faramarz Hendessi Introductory Computer Networks Lecture 4 Fall 2010 Isfahan University of technology Dr. Faramarz Hendessi 2 Types of Networks Circuit

More information

1/29/2008. From Signals to Packets. Lecture 6 Datalink Framing, Switching. Datalink Functions. Datalink Lectures. Character and Bit Stuffing.

1/29/2008. From Signals to Packets. Lecture 6 Datalink Framing, Switching. Datalink Functions. Datalink Lectures. Character and Bit Stuffing. /9/008 From Signals to Packets Lecture Datalink Framing, Switching Peter Steenkiste Departments of Computer Science and Electrical and Computer Engineering Carnegie Mellon University Analog Signal Digital

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand Matthew Koop 1,2 Terry Jones 2 D. K. Panda 1 {koop, panda}@cse.ohio-state.edu trj@llnl.gov 1 Network-Based Computing Lab, The

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

Scalable Multiprocessors

Scalable Multiprocessors arallel Computer Organization and Design : Lecture 7 er Stenström. 2008, Sally A. ckee 2009 Scalable ultiprocessors What is a scalable design? (7.1) Realizing programming models (7.2) Scalable communication

More information

A trace-driven analysis of disk working set sizes

A trace-driven analysis of disk working set sizes A trace-driven analysis of disk working set sizes Chris Ruemmler and John Wilkes Operating Systems Research Department Hewlett-Packard Laboratories, Palo Alto, CA HPL OSR 93 23, 5 April 993 Keywords: UNIX,

More information

Can User-Level Protocols Take Advantage of Multi-CPU NICs?

Can User-Level Protocols Take Advantage of Multi-CPU NICs? Can User-Level Protocols Take Advantage of Multi-CPU NICs? Piyush Shivam Dept. of Comp. & Info. Sci. The Ohio State University 2015 Neil Avenue Columbus, OH 43210 shivam@cis.ohio-state.edu Pete Wyckoff

More information

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of

More information

TEG: A High-Performance, Scalable, Multi-Network Point-to-Point Communications Methodology

TEG: A High-Performance, Scalable, Multi-Network Point-to-Point Communications Methodology TEG: A High-Performance, Scalable, Multi-Network Point-to-Point Communications Methodology T.S. Woodall 1, R.L. Graham 1, R.H. Castain 1, D.J. Daniel 1, M.W. Sukalski 2, G.E. Fagg 3, E. Gabriel 3, G. Bosilca

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

High Performance MPI-2 One-Sided Communication over InfiniBand

High Performance MPI-2 One-Sided Communication over InfiniBand High Performance MPI-2 One-Sided Communication over InfiniBand Weihang Jiang Jiuxing Liu Hyun-Wook Jin Dhabaleswar K. Panda William Gropp Rajeev Thakur Computer and Information Science The Ohio State University

More information

ECE 650 Systems Programming & Engineering. Spring 2018

ECE 650 Systems Programming & Engineering. Spring 2018 ECE 650 Systems Programming & Engineering Spring 2018 Networking Introduction Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) Computer Networking A background of important areas

More information

Network Control and Signalling

Network Control and Signalling Network Control and Signalling 1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches

More information

Lightweight Remote Procedure Call

Lightweight Remote Procedure Call Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy ACM Transactions Vol. 8, No. 1, February 1990, pp. 37-55 presented by Ian Dees for PSU CS533, Jonathan

More information

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES Sergio N. Zapata, David H. Williams and Patricia A. Nava Department of Electrical and Computer Engineering The University of Texas at El Paso El Paso,

More information

RED behavior with different packet sizes

RED behavior with different packet sizes RED behavior with different packet sizes Stefaan De Cnodder, Omar Elloumi *, Kenny Pauwels Traffic and Routing Technologies project Alcatel Corporate Research Center, Francis Wellesplein, 1-18 Antwerp,

More information

Summary of MAC protocols

Summary of MAC protocols Summary of MAC protocols What do you do with a shared media? Channel Partitioning, by time, frequency or code Time Division, Code Division, Frequency Division Random partitioning (dynamic) ALOHA, S-ALOHA,

More information

Improving TCP Performance over Wireless Networks using Loss Predictors

Improving TCP Performance over Wireless Networks using Loss Predictors Improving TCP Performance over Wireless Networks using Loss Predictors Fabio Martignon Dipartimento Elettronica e Informazione Politecnico di Milano P.zza L. Da Vinci 32, 20133 Milano Email: martignon@elet.polimi.it

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview Chapter 4: chapter goals: understand principles behind services service models forwarding versus routing how a router works generalized forwarding instantiation, implementation in the Internet 4- Network

More information

Parallel Programming

Parallel Programming Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

The Design and Implementation of a MPI-Based Parallel File System

The Design and Implementation of a MPI-Based Parallel File System Proc. Natl. Sci. Counc. ROC(A) Vol. 23, No. 1, 1999. pp. 50-59 (Scientific Note) The Design and Implementation of a MPI-Based Parallel File System YUNG-YU TSAI, TE-CHING HSIEH, GUO-HUA LEE, AND MING-FENG

More information

Switch Configuration message sent 1 (1, 0, 1) 2

Switch Configuration message sent 1 (1, 0, 1) 2 UNIVESITY COLLEGE LONON EPATMENT OF COMPUTE SCIENCE COMP00: Networked Systems Problem Set istributed: nd November 08 NOT ASSESSE, model answers released: 9th November 08 Instructions: This problem set

More information

Networking interview questions

Networking interview questions Networking interview questions What is LAN? LAN is a computer network that spans a relatively small area. Most LANs are confined to a single building or group of buildings. However, one LAN can be connected

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

NoWait RPC: Extending ONC RPC to a fully compatible Message Passing System

NoWait RPC: Extending ONC RPC to a fully compatible Message Passing System NoWait RPC: Extending ONC RPC to a fully compatible Message Passing System Thomas Hopfner Franz Fischer Georg Färber Laboratory for Process Control and Real Time Systems Prof. Dr. Ing. Georg Färber Technische

More information

Architecture or Parallel Computers CSC / ECE 506

Architecture or Parallel Computers CSC / ECE 506 Architecture or Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models 6/19/2006 Dr Steve Hunter Back to Basics Parallel Architecture = Computer Architecture + Communication Architecture

More information

RICE UNIVERSITY. High Performance MPI Libraries for Ethernet. Supratik Majumder

RICE UNIVERSITY. High Performance MPI Libraries for Ethernet. Supratik Majumder RICE UNIVERSITY High Performance MPI Libraries for Ethernet by Supratik Majumder A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE Approved, Thesis Committee:

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca CSCI 1680 Computer Networks Fonseca Homework 1 Due: 27 September 2012, 4pm Question 1 - Layering a. Why are networked systems layered? What are the advantages of layering? Are there any disadvantages?

More information