Improving PVM Performance Using ATOMIC User-Level Protocol. Hong Xu and Tom W. Fisher. Marina del Rey, CA

Size: px
Start display at page:

Download "Improving PVM Performance Using ATOMIC User-Level Protocol. Hong Xu and Tom W. Fisher. Marina del Rey, CA"

Transcription

1 Improving PVM Performance Using ATOMIC User-Level Protocol Hong Xu and Tom W. Fisher Information Sciences Institute University of Southern California Marina del Rey, CA Abstract Parallel virtual machine (PVM) software system provides a programming environment that allows a collection of networked workstations to appear as a single concurrent computational resource. The performance of parallel applications in this environment depends on the performance of reliable data transfers between tasks. In this paper, we improve PVM communication performance over the ATOMIC LAN, a gigabit per-second local area network. This is achieved by separating the PVM data-transmission path from the PVM control-message path and transmitting PVM data messages over a user-level application programming interface (API) provided by the Myrinet ATOMIC interface. The Myrinet-API, although signicantly faster than the TCP/IP kernel stack, does not provide reliable communication. Therefore, a user-level protocol has been developed based on the Myrinet-API to oer reliable sequenced packet delivery. This protocol provides faster data transfers between PVM tasks over the ATOMIC LAN. Performance results obtained at USC/ISI demonstrate that our version of PVM can reach up to 140 Mbps throughput, which is 94% of the achievable network bandwidth over the ATOMIC LAN. 1 Introduction Over the past few years a rapid increase in the performance of microprocessors as well as in the speed of local area networks (LAN) has made workstation clusters a viable platform to solve computation-intensive problems. PVM [1, 2], a message passing based software system, provides a programming environment that allows a collection of networked workstations to appear as a single parallel computational resource. PVM currently provides facilities for process control and data communication based on the TCP/IP network protocol suite [3]. By taking advantage of popular TCP/IP implementations, PVM has become a This work is supported by the Advanced Research Projects Agency through Ft. Huachuaca contract #DABT63-93-C-0062 entitled \Netstation Architecture and Advanced Atomic Network". The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the ocial policies, either expressed or implied, of the Department of the Army, the Advanced Research Projects Agency, or the U.S. Government. widely used de facto standard for distributed computing on high-performance workstations interconnected through a variety of high-speed networks. The performance of PVM applications greatly depends on the performance of reliable data transfers between workstations. There are two basic classes of PVM applications: task parallel applications and data parallel applications. For task parallel programs, communication latency of synchronization messages is critical to the overall application performance. In data parallel programs, data are partitioned and distributed to all host computers in the PVM conguration. The same operations are performed on each partition of data by PVM processes running on each host. PVM processes then communicate with one another to exchange values of updated data. In such situations, bulk data transfers frequently occur between the workstations, which can result in performance bottlenecks. Minimizing the communication overhead involved in PVM data transfers for both task parallel and data parallel applications requires an especially high-throughput low-latency form of reliable communication. The ATOMIC [4] LAN is a high-speed network that oers 640 Mbps bandwidth. The ATOMIC workstation cluster at USC/ISI consists of a collection of Sun SPARCstation 20s interconnected through the ATOMIC LAN. At each host computer, a user-level process can access the ATOMIC LAN through either the TCP/IP kernel network protocol stack or the user-level ATOMIC Myrinet-API. The performance of the general-purpose TCP/IP stack is limited due to the overhead of data copies and software layers in the kernel. On the other hand, the Myrinet-API exhibits higher throughput and lower latency, since the Myrinet-API allows any user-level process to directly access the network without going through the kernel. PVM data communication is currently implemented using TCP/IP. Thus, the performance of PVM data transfers over the ATOMIC LAN is limited by the performance of TCP/IP. In this paper, we address how to improve PVM task-to-task data communication performance over the ATOMIC LAN by separating the PVM data-transfer path from the PVM control-message path. The key point behind our work is to replace the critical path of PVM service by a fast user-level protocol based on the high-bandwidth, lowlatency Myrinet-API. This ATOMIC transport-layer

2 protocol (ATP) provides reliable, sequenced packet delivery, point-to-point communication for faster data transfers between PVM tasks on dierent workstations. We observe a signicant increase in the performance of PVM task-to-task data communication using the proposed ATP. The performance results show that our version of PVM can reach up to 140 Mbps throughput, which is 94% of the achievable network bandwidth over the ATOMIC LAN. We further conjecture that performance of future distributed computing applications using our version of PVM over the ATOMIC LAN will also be greatly increased. The remainder of the paper is organized as follows. Section 2 describes the ATOMIC networking architecture. Section 3 decomposes the PVM service in order to reimplement the critical path of PVM task-to-task data communication using the fast user-level protocol over the Myrinet-API. Section 4 presents our implementation, including the ATP development, as well as our scheme that interleaves memory copies with DMA operations. Section 5 gives the performance results obtained from the USC/ISI ATOMIC workstation cluster. Section 6 discusses related work and Section 7 mentions several areas of future research. Section 8 presents the conclusions. 2 ATOMIC Networking Architecture The ATOMIC LAN is a high-speed network that oers 640-Mbps bandwidth at an inexpensive per-host cost. It is a switch-based local area network composed of host interface boards and network switches. A network switch in the ATOMIC LAN is a nonblocking crossbar in which packets are simultaneously relayed from distinct incoming ports to distinct outgoing ports with no internal channel collisions. Packets competing for the same outgoing port are arbitrated in the round-robin fashion at a per-packet level. The ports on both the switch and the host interface board are bi-directional and are connected to each other by a twisted-pair link with two independent uni-directional channels. Switch ports and host interface ports can be connected to one another in arbitrary topologies including trees, meshes, etc. Packet transmission over the ATOMIC LAN has a very low error rate. Packet transfer tests of over 1,000 Terabits ( bits) over the prototype ATOMIC LAN [4] at USC/ISI resulted in no bit errors and no lost packets. The ATOMIC LAN employs a source-routed, cutthrough packet switching technique. In cut-through routing, the header bytes in the packet determine the route, and the remaining data bytes follow in a pipeline fashion. At each intermediate switch, the packet can be advanced into the required outgoing channel as soon as the header is received and decoded. If the header encounters a busy channel, the rest of the packet will hold and block all upstream channel(s) in the routing path. Unlike store-and-forward switching, cut-through routing requires no packet buering at each intermediate switch and thus provides low packet transmission latency. When a source host sends multiple packets to a destination host, cut-through routing forces packets arriving at the destination to be in order, provided all packets follow the same route. Host interface boards connect hosts to the network. Each host interface consists of a processor with a dualport memory. The dual-port on-board memory is accessible to both the on-board processor and the host computer. The host computer can read/write onboard memory through programmable I/O (PIO) and data can be moved between on-board memory and the host computer main memory via direct memory access (DMA) that is supported on the host interface board. The control program (CP) running on the on-board processor can be loaded from the host computer using PIO. Checksum hardware is also provided on the interface board so that the link-layer checksum of every packet can be computed in hardware while the packet is being either sent or received. Host Computer Interface Board possible data communication paths User space Kernel UDP IP TCP device driver user process Multi interface CP Myrinet API I/O Bus Figure 1: Dual-interface CP for IP and API trac The CP on the host interface is designed to allow multiple processes on the host computer to pass data through to the on-board processor at the same time (that is given no two hosts are accessing overlapping sections of on-board memory). The CP is also termed multi-interface CP since it provides an independent application device interface, also known as the application device channel in [5], for each process on the host computer to directly communicate with the host interface. Figure 1 shows a dual-interface CP that is able to support a host simultaneously running packets over both IP and the Myrinet-API. In terms of both observed bandwidth and latency, communication over TCP/IP performs worse than communication over the Myrinet-API. While TCP/IP is a kernel networking protocol stack, the Myrinet- API is a user-level communication library that allows a user-level process to directly access and communicate with the host interface board. Using the dualinterface CP provided by Myricom, the values of peak bandwidth for TCP/IP, UDP/IP, and the API are 45 Mbps, 55 Mbps, and 150 Mbps, respectively, measured on Sun SPARCstation 20/50s (running SunOS 4.1.3)

3 interconnected through the ATOMIC LAN. The original prototype for the ATOMIC LAN was developed by USC/ISI using Caltech's Mosaic chip [6]. The network switches, hosts interfaces, and system software are currently provided by Myricom [7] as commercial products. Although Myricom has made many improvements in both hardware and software based on the ATOMIC prototype, its commercial product uses the original ATOMIC networking architecture. Researchers at USC/ISI are currently using the Myricom product and continuing ATOMIC research [8] in supporting the ATOMIC LAN outside the lab environment for daily computing and networking needs. 3 PVM Service Decomposition PVM allows a collection of host computers to appear as a single virtual distributed-memory parallel machine. The PVM system is composed of a daemon, known as pvmd in [2], and a library of PVM interface routines. Pvmd is a user-level daemon process running on each host computer that is a part of a given PVM conguration. PVM interface library routines are used by each application process for data communication and process control. A PVM application process is called a PVM task, or task for short. On each host computer, the pvmd serves as a message router that multiplexes messages from local task(s) to the network and demultiplexes messages from the network to local task(s). Pvmd also acts as a coordinator responsible for authentication and process control. Application computational load is distributed among tasks running on host computers. The pvmd does not do any computation. There is only one pvmd running on each host computer while there may be multiple tasks running on the same host. Host Computer Interface Board Kernel TCP path default message path User space UDP pvmd IP TCP device driver PVM task Multi interface CP Myrinet API Figure 2: General PVM service model I/O Bus As shown in Figure 2, in the general PVM service model [2], a message sent from a local task to another task on another host is rst, by default, routed to the local pvmd. The message is then forwarded by the local pvmd to the pvmd on the remote host. The message is then nally transferred to the destination task by the remote pvmd. Communication between a task and the local pvmd on the same host is achieved by using TCP/IP. The pvmds on dierent hosts communicate with one another through a built-in PVM reliable communication protocol using UDP/IP. Routing messages through pvmds, however, introduces a significant overhead due to extra message copying between PVM tasks and their respective PVM daemons. To improve message passing performance, a task is allowed to talk to another remote task directly through a TCP/IP connection. Such a direct message passing scheme is known as message direct routing over TCP in [2], or TCP path for short. No pvmd is involved in message passing using the TCP path. In the USC/ISI ATOMIC workstation cluster, communication latency is an average of 1.2 ms and the peak communication bandwidth is 11 Mbps when messages take the default route through respective pvmds. The latency is decreased to an average of 1.1 ms and the peak bandwidth is increased to 30 Mbps when messages take direct routes from task to task using TCP/IP. Note that communication over TCP/IP can reach a peak bandwidth of 45 Mbps in the ATOMIC LAN. PVM's task-to-task peak bandwidth is lower due to the extra memory copies for packing the data on the sender site and unpacking the data on the receiver site. Host Computer Interface Board Kernel ATP path default message path User space UDP pvmd IP TCP device driver PVM task Multi interface CP ATP Myrinet API I/O Bus Figure 3: Decomposed PVM service in ATOMIC Since the Myrinet-API peak bandwidth of 150 Mbps in the ATOMIC LAN is much higher than the TCP/IP peak bandwidth of 45 Mbps, the performance of PVM task-to-task data transfers can be signicantly improved by forwarding messages over the Myrinet-

4 data transfer flow Task 0 Task 1 PVM task data space PVM task data space on interface board memory host user space memory I/O bus pvm_send() pvm_recv() atp_send() atp_recv() memory copy head recv_ request_ uncachable memory for PVM buffer tail application device interface 0 head tail DMA send_ request_ send_packet_ recv_packet_ DMA memory copy uncachable memory for PVM buffer head recv_ request_ tail application device interface 1 pvm_send() pvm_recv() atp_send() atp_recv() head tail send_ request_ Figure 4: Packet ow on the host machine API at the user level, instead of through the kernel over TCP/IP. As shown in Figure 3, we propose communicating data over the Myrinet-API to speed up data transmission performance between PVM tasks. We refer to this approach as using the ATP path. The basic idea is to separate the PVM data-transfer path from the PVM control-message path. As a result, a fast data communication path is provided between PVM tasks while maintaining pvmd functionality for process control. In the next section, the ATP protocol is presented. 4 User-Level Protocol Implementation This section describes the ATP protocol implementation which oers reliable point-to-point communication to PVM tasks. The rst two subsections describe the underlying packet ow on the host machine as well as how the overhead of memory copies is reduced. The last subsection presents the details of the protocol. Our implementation uses the same task-to-task communication modes as specied in PVM [2]. A receive (pvm recv()) is a blocking receive which returns only if the data has been received. A send (pvm send()) is a blocking send which returns when all data has been acknowledged by the receiver and the PVM buer that holds unacknowledged data is free for reuse. 4.1 Packet-Flow on the Host Machine The basic communication structure between the user-level ATP protocol and host interface is shown in Figure 4. In this paper, a message refers to a data transfer at the PVM programming level. Conceptually, the size of a single message is unlimited. A packet refers to a data transfer at either the Myrinet-API level or the host interface level. The size of a single packet is limited to 8K bytes, the maximum transfer unit (MTU) for the ATOMIC LAN. API-level packets and interface-level packets use the same MTU. Nei-

5 ther segmentation nor concatenation takes place when packets are DMA transferred between the PVM buer and on-board memory. However, segmentation (concatenation) occurs when a message, which is larger than the MTU, is copied to (from) PVM user space from (to) the PVM buer. The PVM buer space that is used to buer send and receive packets is declared as uncachable in order to avoid cache coherence problems when this memory is overwritten by DMA upon receiving packets. Both the host CPU and the on-board DMA unit can access the PVM buer. After an outgoing packet has been DMA transferred to the host interface, it is stored in a send packet in on-board memory and waits to be delivered into the network. Similarly, after an incoming packet has been received by the host interface, it is stored in a recv packet and waits to be DMA transferred to the host computer. The send packet and the recv packet are used only by the on-board processor and are not accessed by the host CPU. The only data structures shared between the host CPU and the on-board processor are the send request and the recv request. The library routines atp send() and atp recv() are responsible for providing reliable message sending and receiving Ẇhen a packet is sent through atp send(), the host CPU appends a request at the tail of the send request. The request contains only the memory pointer to the particular packet in the PVM buer and the packet's size, as opposed to the packet itself. This request will be honored by the on-board processor after all previous requests have been processed. When it is time to service the request, the on-board processor initiates the DMA operation to transfer the packet data to the send packet using the memory pointer and the data size contained in the request. When the DMA operation is done, the request is removed from the send request by the on-board processor. The send request is implemented as a circular with a head pointer and a tail pointer. The head pointer is modied only by the on-board processor and the tail pointer is modied only by the host CPU. As a result, requests in the send request are serviced in FIFO order. The host CPU has to wait when the send request is full, and the on-board processor has to wait when the is empty. The dual-port on-board memory guarantees the atomicity of singleword load and store operations, thus the consistency of concurrent accesses is ensured. When a packet receive operation is issued in atp recv(), the host CPU appends a request at the tail of the recv request. Besides the expected packet size and the memory pointer (pointing to the allocated PVM buer reserved for the expected packet), the request also contains the source PVM task ID and the PVM message tag [2] used to identify the requested PVM message. After the packet arrives at the host interface, the on-board processor will verify the source task ID and the message tag, and initiate the DMA operation to transfer the packet to the reserved memory. Finally, the request is removed from the by the on-board processor. For the recv request there is no FIFO ordering of requests. Instead, each request may be examined in order to match an arrived packet since packets can arrive from dierent PVM tasks out of order. A packet that does not match any receive request is dropped. In the multi-interface CP, each application device interface is associated with an independent send request and recv request. As shown in Figure 4, on the local host computer, two PVM tasks (task 0 and task 1) are assigned independent application device interfaces (interface 0 and interface 1) to the host interface. Only the owner task can access its own pair of s. On each host, an application device interface is statically assigned to a PVM task during the task initialization. This task-interface assignment is known to all other PVM tasks. 4.2 Reducing Memory Copy Overhead As shown in Figure 4, a section of uncachable memory space serves as the PVM buer which is physically disjoint from the rest of PVM task data space. On Sun SPARCstations, the PVM buer has to be uncachable, otherwise cache coherence problems will occur when the PVM buer is overwritten by DMA upon receiving packets and the corresponding cache lines would contain stale data. By using a separate PVM buer, data has to be moved between PVM task space and the PVM buer, which introduces the overhead of an extra memory copy. host user space memory I/O bus on board memory DMA PVM task data space step 1 memory copy step 2 PVM buffer step 3 step 2 step 5 step 4 step 3 step 4 send packet Figure 5: Interleaving memory copies with DMA operations on the sender An alternative approach, which can avoid the memory copy, is to make memory in PVM task data

6 space uncachable. Then, data could be DMA transferred to/from the host interface directly without going through any PVM buer. However, this approach causes another problem. Computation speed will be dramatically reduced (20% on Sun SPARCstation 20/50) if the referenced data is allocated in uncachable memory. It is non-trivial and inappropriate for application users to manipulate cachable and uncachable data space at the programming level in order to maintain the proper computation speed. For this reason we have chosen to implement the memory copy approach. For large-sized messages, the overhead of memory copies can be greatly reduced by interleaving them with DMA operations. As shown in Figure 5, a largesized message is segmented into multiple packets after atp send() is called. Instead of copying the whole message into the PVM buer, only the rst MTU bytes are copied and lled into the rst packet in step 1. In step 2, while the rst packet is being DMA transferred by the host interface, the next MTU bytes are copied and lled into the second packet. On a Sun SPARCstation 20/50, a memory copy of 8K (MTU) bytes takes about 126 microseconds while a DMA transfer of 8K-bytes takes about 210 microseconds over the Sun SBus. Thus, the memory copy of the second 8Kbytes into the PVM buer can take place in parallel with the DMA transfer of the rst packet into the send packet. The same procedure repeats until the last packet is DMA transferred into the on-board memory. Except for the rst packet, the overhead of successive memory copies can be dramatically reduced by using this interleaving optimization. The overhead of memory copies at the receiving end is also signicantly reduced by interleaving them with DMA operations. 4.3 Reliable Data Protocol ATP is a user-level, reliable, sequenced packet delivery protocol which has been developed for the ATOMIC workstation cluster to be used directly with the Myrinet-API. By taking advantage of ATOMIC LAN, our protocol implementation is less complicated and oers high-performance. A Stop-and-Wait Protocol A PVM message is identied by the PVM task ID and the PVM message tag. If the message size is larger than the MTU over the ATOMIC LAN, it is segmented into multiple packets each of which is further identied by a unique sequence number. The TAIL bit in the packet header will be set for the last packet. Figure 6(a) shows the basic idea of how the protocol works. For simplicity, only the sequence number of each data packet is shown in Figure 6. In Figure 6, we further assume that all packets shown in each scenario are dierent segments of a single PVM message. The sender atp send() transmits n packets in a row to the receiver atp recv(). Atp send() then stops and waits for an acknowledgement (ACK) that the receiver has indeed received the data. After receiving all n packets, atp recv() at the remote end sends an ACK back to atp send(). When the ACK reaches atp send(), another n packets may then be transmitted to the remote atp recv(). We refer to this approach as bundle-sending and the number \n" as the bundle size. Each ACK consists of two elds. The rst eld contains the sequence numbers for those packets which are not received by atp recv(). The second eld contains the value of the bundle size oered by atp recv(). In Figure 6(a), no packet is lost. Thus, the rst eld in each ACK is empty and the second eld is n, the size of the next bundle to be sent. Error Checking and Packet Loss Recovery The host interface provides a hardware checksum. If the host interface receives a corrupted packet, it will drop the packet without informing the host computer. Therefore, if a packet reaches the user-level protocol, it is guaranteed to be a good packet. No software checksum is carried out in ATP. In the ATOMIC LAN, packets from the same source arrive at the same destination in the order they were sent (provided that all packets use the same route). Thus, atp recv() can easily tell whether any packet is lost and which packets are lost by observing the sequence number of the last received packet of the current bundle being sent. This information is included in the ACK and is sent back to atp send(). This scheme is known as selective acknowledgement [9]. For example, in Figure 6(b), ve packets are sent from atp send() to the remote atp recv(). Packets 2 and 3 become lost. As soon as packet 4 is received, atp recv() realizes the loss of packets 2 and 3 (This is because we are able to assume in the ATOMIC LAN that no packets will arrive out of order). Atp recv() records which packets were lost and continues receiving packets untill either the last packet is received or (if the last packet has also been dropped) atp recv() times out. At this point, in Figure 6(b), atp recv() sends back a \selective acknowledgement" which tells atp send() to restrict its next bundle size to 2, and to retransmit packets 2 and 3. Atp send() will then, upon receipt of the selective ACK, carry out the necessary retransmissions and wait for another ACK. This process will repeat until the receiver has positively acknowledged all expected data. In Figure 6(b), atp send() retransmits packets 2 and 3. After these two packets have been successfully received, atp recv() then adjusts its expected bundle size to 5 and informs atp send() via the next ACK. The ATOMIC LAN has a very low error rate. The ACKs are seldom corrupted across cables or network switches. However, an ACK could be dropped by the sender host interface if the sender host computer is overloaded by other trac being received. In our implementation, if an ACK becomes lost, atp send() will timeout and retransmit all n data packets. This retransmission scheme can potentially degrade the performance since the entire bundle of packets are retransmitted. To avoid dropping ACKs, a special buer could be created in on-board memory for receiving out-of-band data including ACKs sent from atp recv().

7 atp_send() atp_recv() As a result, the sender host interface will have room to store ACKs even if its regular recv packet is full. bundled packets bundled packets bundled packets starting point atp_send() timeout bundled packets atp_send() ack[ ] [n] ack[ ] [n] ack[2,3] [2] ack[ ] [5] packet 1 packet 2 packet n packet n+1 packet 2n (a) bundle sending packet 1 packet 2 packet 3 packet 4 packet 5 packet 2 packet 3 (b) packet retransmission READY [n] FIN packet 0 packet 0 packet 0 packet 1 packet n 1 Lost packet Good packet Good ACK atp_recv() atp_recv() (c) transmission startup and termination Figure 6: Reliable data protocol starting point Congestion Control and Flow Control When congestion occurs in the ATOMIC LAN, Myrinet link-layer back-pressure hardware will prevent senders from injecting more packets into the network [10]. Therefore, our protocol does not provide extra software support to deal with congestion control over the ATOMIC LAN. Flow control in the ATOMIC LAN is handled by atp recv() in how it controls the bundle size it advertises to atp send(). In situations where the receiver workstation is experiencing either heavy CPU load or heavy load at its network interface (many packets being sent to it), or both, atp recv() will begin to notice packets being dropped. At this point, each ACK generated by atp recv() will include the sequence numbers for all dropped packets as well as smaller bundle sizes for atp send() to use. In eect, the bundle size will decrease to a certain point (dependent upon how loaded the receiver actually is) and gradually stabilize to a size representing the number of packets the receiver machine can handle without dropping. Transmission Startup and Termination In the distributed PVM environment there are no assumptions made regarding the execution order of the sender and receiver. Either the receiver atp recv() can start rst or the sender atp send() can start rst. Our protocol works well in either case. When atp recv() is called, it immediately generates an initial READY packet and sends (or broadcasts) it to all host machines that the receiver is interested in. This READY packet will contain the bundle size advertised by atp recv(). The READY packet is important so that any sender that has been polling the receiver (as mentioned below) can be informed immediately to begin sending. For example, in Figure 6(c), atp send() starts earlier than atp recv(). The READY packet from atp recv() immediately informs atp send() which is busy-waiting for the receiver's signal. As shown in Figure 6(c), a bundle-send begins as soon as the handshake has been established. If there is no immediate response from any valid senders, atp recv() will busy-wait. When atp send() is called, as shown in Figure 6(c), it sends only the initial packet (packet 0) to atp recv() and then busy-waits for an ACK. This initial packet is an actual data packet that contains the rst MTU bytes of the sending message (or the whole message if the message is smaller than the MTU). If atp recv() has been ready, it will send back an ACK containing the advertised bundle size. In Figure 6(c), atp recv() is not ready yet. Thus, this initial packet will be dropped by the receiver host interface. Atp send() will eventually timeout and retransmit the initial packet. This \polling" process repeats until atp recv() is ready and responds by sending back the initial READY packet.

8 A tradeo exists regarding the length of the sender's timeout value. A smaller timeout will allow faster recovery if the READY packet sent by atp recv() is lost. However, the smaller the timeout is, the greater the amount of network trac will be generated by atp send() untill atp recv() eventually responds. To conclude transmission atp send() will set the TAIL bit in the last data packet of the entire message. Upon receipt of this last packet, atp recv() will respond by sending a FIN(ish) packet back to atp send before returning. As soon as the FIN packet arrives at the sender, atp send() will also return. 5 Performance Results This section shows the performance results obtained in testing PVM task-to-task communication using TCP/IP and ATP over the ATOMIC workstation cluster. In particular, all measurements have been made between two Sun SPARCstation 20/50s running SunOS These two machines are connected directly to each other through a Myrinet switch. In our experiments, the receiver pvm recv always starts earlier than the sender pvm send. In Figures 7, 8 and 9, the term \TCP path" refers to data transfers through the kernel over TCP/IP, and the term \ATP path" refers to data transfers over the Myrinet-API using ATP. Latency (sec) TCP path 2 ATP path Message Size (bytes) Figure 7: Latency comparison Figure 7 plots latency values for PVM task-to-task communication using the TCP path and the ATP path. The ATP path latency is only 50% of the TCP path latency. Figure 8 compares the bandwidth values of PVM task-to-task communication using the TCP path and the ATP path. The message sizes measured are those equal to and smaller than the 8k-byte MTU. For these small messages, the bundle size is always one with no interleaving of memory copies and DMA operations. However, even without using bundle sizes greater than one or the interleaving optimization, communication using the ATP path consistently outperforms communication using the TCP path. In Figure 9, the \raw API" bandwidth refers to the bandwidth of user-level data transfers using the Bandwidth (Mbps) ATP path 4 TCP path Message Size (Kbytes) Figure 8: Bandwidth comparison for small messages unreliable raw Myrinet-API. The \raw API" bandwidth of 150 Mbps is obtained by DMA transfers of data directly to/from user process data space with no memory copies taking place on either host. This raw API bandwidth oers an upper bound for any user-level protocol using the Myrinet-API. Figure 9 shows that, when the message size increases up to 256 Kbytes, the bandwidth of reliable communication over the Myrinet-API can reach up to 94% of the \raw API" bandwidth. The experimental results demonstrate that the interleaving optimization technique greatly reduces overhead due to memory copying. Without the interleaving optimization, the bandwidth of reliable communication over the Myrinet-API is only 66% of the \raw API" bandwidth. Bandwidth (Mbps) Raw API ATP path 4 ATP path w/o Interleaving + TCP path Message Size (Kbytes) Figure 9: Bandwidth comparison for large messages Our performance results show that the approach of using ATP over the Myrinet-API achieves lower latency and higher bandwidth than using TCP/IP. Thus, we conjecture that our version of PVM will

9 greatly increase the overall PVM performance over the ATOMIC LAN for bandwidth-sensitive data parallel applications as well as latency-sensitive task parallel applications. 6 Related Work This section only discusses the related work regarding reliable communication over the ATOMIC LAN. The related issues of implementing user-level protocols on other network architecture can be found in [11, 12, 13]. The host interface is an alternative place to implement a reliable data protocol. An RPC-based (remote procedure call based) reliable protocol [14] has been implemented in the host interface CP in order to minimize the per-packet overhead and optimize communication latency for small messages. This protocol was specicly designed for device control over the ATOMIC LAN in the USC/ISI Netstation Project [15]. This protocol generates a very low RTT (round trip time) for device control command streams. The major reason driving us to implement reliable communication in the host computer is the limitation of the on-board processor speed and on-board memory capacity. With a 128K-byte RAM and 25 MHz LANai processor [10], a Myrinet ATOMIC interface can only aord to support a very limited number of application device interfaces without signicantly degrading its performance. Thus, an application device interface may have to be shared by multiple processes among many various user applications including distributed-computing and real-time applications. There are no assumptions regarding characteristics of ATOMIC network trac ow. The same application device interface (to the host interface) can be shared, for example, by a PVM task and a teleconferencing application which does not require reliable data transmission. On the other hand, the on-board LANai processor could be overloaded if reliable communication is supported in the host interface for data transmitted over each application device interface. For this reason, we think that reliable communication is applicationbased and such a protocol should be implemented in the host computer at the transport layer. 7 Future Work Similar to task-to-task communication, pvmd-topvmd communication can also be implemented using our user-level reliable data protocol in the ATOMIC workstation cluster. For the point-to-point communication problem addressed in this paper, the performance improvement of implementing pvmd-to-pvmd communication over the Myrinet-API may not be signicant, since pvmds are not in the critical path of point-to-point communication. However, pvmd is responsible for collective communication that involves three or more PVM tasks. For example, the current PVM multicast implementation only routes multicast messages through pvmds. A signicant performance improvement is expected by decomposing pvmd service and implementing PVM collective communication at the Myrinet-API level. The benet of the approach presented in this paper takes place only when hosts in the PVM conguration are directly connected to the same ATOMIC LAN. In a heterogeneous network environment with dierent kinds of high-speed LAN segments, two hosts may not talk to one another without going through TCP/IP. Our current research addresses how to implement user-level TCP/UDP network protocols over the ATOMIC LAN. Similar to the performance of Myrinet-API communication, we expect the performance of user-level TCP/UDP protocols to be much higher than that of their associated kernel network protocol stacks and so yield greater benet to userlevel applications including PVM. 8 Conclusion In this paper, we have decomposed the PVM service over the high-speed ATOMIC LAN, by separating the data transfer path from the control message path. The critical path of PVM point-to-point data communication has been replaced by a fast and reliable userlevel protocol implemented over the Myrinet-API. By interleaving memory copies and DMA transfers, the memory copy overhead has been greatly reduced, resulting in an enhanced PVM implementation that can provide up to 140 Mbps PVM bandwidth over the ATOMIC LAN. The performance results have shown that our user-level reliable protocol has achieved 94% of the bandwidth upper bound established by communication over the unreliable, zero-memory-copy \raw" Myrinet-API. A signicant increase in PVM application performance is expected to be gained by using our version of PVM over an ATOMIC LAN. Acknowledgements The authors would like to thank Jon Postel, Greg Finn, Joe Touch, Celeste Anderson, Ted Faber, Annette DeSchon, and Steve Hotz at USC/ISI for their suggestions regarding this work and Myricom for the useful dialogue regarding their dual-interface LANai control program. References [1] A. Beguelin, J. Dongarra, A. Geist, R. Manchek, S. Otto, and J. Walpole, \PVM: Experiences, current status and future direction," in Supercomputing'93 Proceedings, pp. 765{766, Nov [2] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM 3 User's Guide and Reference manual. Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, Sept [3] J. B. Postel, Transmission Control Protocol. RFC 792, Sept [4] R. Felderman, A. DeSchon, D. Cohen, and G. Finn, \ATOMIC: A high-speed local communication architecture," Journal of High Speed Networks, vol. 3, no. 1, pp. 1{30, [5] P. Druschel, L. L. Peterson, and B. S. Dave, \Experiences with a high-speed network adaptor:

10 A software perspective," in Proceedings of SIG- COMM'94, pp. 2{13, Sept [6] C. L. Seitz, N. Boden, J. Seizovic, and W. Su, \The design of the caltech mosaic c. multiprocessor," in Proceedings of the Washington Symposium on Integrated Systems, (Seattle), [7] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W.-K. Su, \Myrinet: A gigabit-per-second local area network," IEEE Micro, vol. 15, pp. 29{35, Feb [8] J. Touch, A. DeSchon, H. Xu, T. Faber, T. Fisher, and A. Sachdev, \ATOMIC-2: Production use of a gigabit lan (abstract)," in Proceedings of Gigabit Networking Workshop'95 at INFOCOM'95, Apr [9] V. Jacobson and R. Braden, TCP Extensions for Long-Delay Paths. RFC 1072, Oct [10] Myricom, Myrinet Link and Routing Specication, Jan [11] C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska, \Implementing network protocols at user level," in Proceedings of SIG- COMM'93, pp. 64{73, Sept [12] A. Edwards, G. Watson, J. Lumley, D. Banks, C. Calamvokis, and C. Dalton, \User-space protocols deliver high performance to applications on a low-cost gb/s LAN," in Proceedings of SIG- COMM'94, pp. 14{23, Sept [13] C. Maeda and B. Bershad, \Protocol service decomposition for high-performance networking," in Proceedings of SIGOPS'93, pp. 244{255, Dec [14] G. G. Finn, \Device control via the network: Interface limitations and issues," in Proceedings of 4th Annual Principal Investigators Meeting (Networking'94), p. 22, ARPA, Sept [15] G. G. Finn, \An integration of network communication with workstation architecture," in omputer Communication Review, Oct

100 Mbps DEC FDDI Gigaswitch

100 Mbps DEC FDDI Gigaswitch PVM Communication Performance in a Switched FDDI Heterogeneous Distributed Computing Environment Michael J. Lewis Raymond E. Cline, Jr. Distributed Computing Department Distributed Computing Department

More information

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract The Avalanche Myrinet Simulation Package User Manual for V. Chen-Chi Kuo, John B. Carter fchenchi, retracg@cs.utah.edu WWW: http://www.cs.utah.edu/projects/avalanche UUCS-96- Department of Computer Science

More information

UNIT IV -- TRANSPORT LAYER

UNIT IV -- TRANSPORT LAYER UNIT IV -- TRANSPORT LAYER TABLE OF CONTENTS 4.1. Transport layer. 02 4.2. Reliable delivery service. 03 4.3. Congestion control. 05 4.4. Connection establishment.. 07 4.5. Flow control 09 4.6. Transmission

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

User Datagram Protocol (UDP):

User Datagram Protocol (UDP): SFWR 4C03: Computer Networks and Computer Security Feb 2-5 2004 Lecturer: Kartik Krishnan Lectures 13-15 User Datagram Protocol (UDP): UDP is a connectionless transport layer protocol: each output operation

More information

Chapter 3 outline. 3.5 Connection-oriented transport: TCP. 3.6 Principles of congestion control 3.7 TCP congestion control

Chapter 3 outline. 3.5 Connection-oriented transport: TCP. 3.6 Principles of congestion control 3.7 TCP congestion control Chapter 3 outline 3.1 Transport-layer services 3.2 Multiplexing and demultiplexing 3.3 Connectionless transport: UDP 3.4 Principles of reliable data transfer 3.5 Connection-oriented transport: TCP segment

More information

CCNA R&S: Introduction to Networks. Chapter 7: The Transport Layer

CCNA R&S: Introduction to Networks. Chapter 7: The Transport Layer CCNA R&S: Introduction to Networks Chapter 7: The Transport Layer Frank Schneemann 7.0.1.1 Introduction 7.0.1.2 Class Activity - We Need to Talk Game 7.1.1.1 Role of the Transport Layer The primary responsibilities

More information

ECE 650 Systems Programming & Engineering. Spring 2018

ECE 650 Systems Programming & Engineering. Spring 2018 ECE 650 Systems Programming & Engineering Spring 2018 Networking Transport Layer Tyler Bletsch Duke University Slides are adapted from Brian Rogers (Duke) TCP/IP Model 2 Transport Layer Problem solved:

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Introduction to Protocols

Introduction to Protocols Chapter 6 Introduction to Protocols 1 Chapter 6 Introduction to Protocols What is a Network Protocol? A protocol is a set of rules that governs the communications between computers on a network. These

More information

Introduction to Networks and the Internet

Introduction to Networks and the Internet Introduction to Networks and the Internet CMPE 80N Announcements Project 2. Reference page. Library presentation. Internet History video. Spring 2003 Week 7 1 2 Today Internetworking (cont d). Fragmentation.

More information

4.0.1 CHAPTER INTRODUCTION

4.0.1 CHAPTER INTRODUCTION 4.0.1 CHAPTER INTRODUCTION Data networks and the Internet support the human network by supplying seamless, reliable communication between people - both locally and around the globe. On a single device,

More information

TCP. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli (Slides by Christos Papadopoulos, remixed by Lorenzo De Carli)

TCP. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli (Slides by Christos Papadopoulos, remixed by Lorenzo De Carli) TCP CSU CS557, Spring 2018 Instructor: Lorenzo De Carli (Slides by Christos Papadopoulos, remixed by Lorenzo De Carli) 1 Sources Fall and Stevens, TCP/IP Illustrated Vol. 1, 2nd edition Congestion Avoidance

More information

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1

6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1 6. Transport Layer 6.1 Internet Transport Layer Architecture 6.2 UDP (User Datagram Protocol) 6.3 TCP (Transmission Control Protocol) 6. Transport Layer 6-1 6.1 Internet Transport Layer Architecture The

More information

Lecture 3: The Transport Layer: UDP and TCP

Lecture 3: The Transport Layer: UDP and TCP Lecture 3: The Transport Layer: UDP and TCP Prof. Shervin Shirmohammadi SITE, University of Ottawa Prof. Shervin Shirmohammadi CEG 4395 3-1 The Transport Layer Provides efficient and robust end-to-end

More information

Transport Layer. Gursharan Singh Tatla. Upendra Sharma. 1

Transport Layer. Gursharan Singh Tatla.   Upendra Sharma. 1 Transport Layer Gursharan Singh Tatla mailme@gursharansingh.in Upendra Sharma 1 Introduction The transport layer is the fourth layer from the bottom in the OSI reference model. It is responsible for message

More information

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Internetworking Multiple networks are a fact of life: Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Fault isolation,

More information

Sequence Number. Acknowledgment Number. Data

Sequence Number. Acknowledgment Number. Data CS 455 TCP, Page 1 Transport Layer, Part II Transmission Control Protocol These slides are created by Dr. Yih Huang of George Mason University. Students registered in Dr. Huang's courses at GMU can make

More information

User Datagram Protocol

User Datagram Protocol Topics Transport Layer TCP s three-way handshake TCP s connection termination sequence TCP s TIME_WAIT state TCP and UDP buffering by the socket layer 2 Introduction UDP is a simple, unreliable datagram

More information

TSIN02 - Internetworking

TSIN02 - Internetworking Lecture 4: Transport Layer Literature: Forouzan: ch 11-12 2004 Image Coding Group, Linköpings Universitet Lecture 4: Outline Transport layer responsibilities UDP TCP 2 Transport layer in OSI model Figure

More information

Transport Over IP. CSCI 690 Michael Hutt New York Institute of Technology

Transport Over IP. CSCI 690 Michael Hutt New York Institute of Technology Transport Over IP CSCI 690 Michael Hutt New York Institute of Technology Transport Over IP What is a transport protocol? Choosing to use a transport protocol Ports and Addresses Datagrams UDP What is a

More information

Myrinet -- A Gigabit-per-Second Local-Area Network

Myrinet -- A Gigabit-per-Second Local-Area Network Myrinet -- A Gigabit-per-Second Local-Area Network (Based on a keynote talk presented by Charles L. Seitz) Nanette J. Boden, Danny Cohen, Robert E. Felderman, Alan E. Kulawik, Charles L. Seitz, Jakov N.

More information

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks Ryan G. Lane Daniels Scott Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306 {ryanlane,sdaniels,xyuan}@cs.fsu.edu

More information

High Speed Communication Protocols. ECE 677, High Speed Protocols 1

High Speed Communication Protocols. ECE 677, High Speed Protocols 1 High Speed Communication Protocols 1 Why? High Speed Transport Protocols Distributed processing - Generally characterized by client-server interactions - operating Systems provide Transparent and highperformance

More information

ECE697AA Lecture 3. Today s lecture

ECE697AA Lecture 3. Today s lecture ECE697AA Lecture 3 Transport Layer: TCP and UDP Tilman Wolf Department of Electrical and Computer Engineering 09/09/08 Today s lecture Transport layer User datagram protocol (UDP) Reliable data transfer

More information

Good Ideas So Far Computer Networking. Outline. Sequence Numbers (reminder) TCP flow control. Congestion sources and collapse

Good Ideas So Far Computer Networking. Outline. Sequence Numbers (reminder) TCP flow control. Congestion sources and collapse Good Ideas So Far 15-441 Computer Networking Lecture 17 TCP & Congestion Control Flow control Stop & wait Parallel stop & wait Sliding window Loss recovery Timeouts Acknowledgement-driven recovery (selective

More information

TSIN02 - Internetworking

TSIN02 - Internetworking TSIN02 - Internetworking Literature: Lecture 4: Transport Layer Forouzan: ch 11-12 Transport layer responsibilities UDP TCP 2004 Image Coding Group, Linköpings Universitet 2 Transport layer in OSI model

More information

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP CS 5520/ECE 5590NA: Network Architecture I Spring 2008 Lecture 13: UDP and TCP Most recent lectures discussed mechanisms to make better use of the IP address space, Internet control messages, and layering

More information

Transport Layer Marcos Vieira

Transport Layer Marcos Vieira Transport Layer 2014 Marcos Vieira Transport Layer Transport protocols sit on top of network layer and provide Application-level multiplexing ( ports ) Error detection, reliability, etc. UDP User Datagram

More information

Transport Layer. Application / Transport Interface. Transport Layer Services. Transport Layer Connections

Transport Layer. Application / Transport Interface. Transport Layer Services. Transport Layer Connections Application / Transport Interface Application requests service from transport layer Transport Layer Application Layer Prepare Transport service requirements Data for transport Local endpoint node address

More information

Guide To TCP/IP, Second Edition UDP Header Source Port Number (16 bits) IP HEADER Protocol Field = 17 Destination Port Number (16 bit) 15 16

Guide To TCP/IP, Second Edition UDP Header Source Port Number (16 bits) IP HEADER Protocol Field = 17 Destination Port Number (16 bit) 15 16 Guide To TCP/IP, Second Edition Chapter 5 Transport Layer TCP/IP Protocols Objectives Understand the key features and functions of the User Datagram Protocol (UDP) Explain the mechanisms that drive segmentation,

More information

Transmission Control Protocol. ITS 413 Internet Technologies and Applications

Transmission Control Protocol. ITS 413 Internet Technologies and Applications Transmission Control Protocol ITS 413 Internet Technologies and Applications Contents Overview of TCP (Review) TCP and Congestion Control The Causes of Congestion Approaches to Congestion Control TCP Congestion

More information

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking

More information

King Fahd University of Petroleum and Minerals College of Computer Sciences and Engineering Department of Computer Engineering

King Fahd University of Petroleum and Minerals College of Computer Sciences and Engineering Department of Computer Engineering Student Name: Section #: King Fahd University of Petroleum and Minerals College of Computer Sciences and Engineering Department of Computer Engineering COE 344 Computer Networks (T072) Final Exam Date

More information

Low-Latency Communication over Fast Ethernet

Low-Latency Communication over Fast Ethernet Low-Latency Communication over Fast Ethernet Matt Welsh, Anindya Basu, and Thorsten von Eicken {mdw,basu,tve}@cs.cornell.edu Department of Computer Science Cornell University, Ithaca, NY 14853 http://www.cs.cornell.edu/info/projects/u-net

More information

05 Transmission Control Protocol (TCP)

05 Transmission Control Protocol (TCP) SE 4C03 Winter 2003 05 Transmission Control Protocol (TCP) Instructor: W. M. Farmer Revised: 06 February 2003 1 Interprocess Communication Problem: How can a process on one host access a service provided

More information

Lecture 4: Congestion Control

Lecture 4: Congestion Control Lecture 4: Congestion Control Overview Internet is a network of networks Narrow waist of IP: unreliable, best-effort datagram delivery Packet forwarding: input port to output port Routing protocols: computing

More information

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput

Topics. TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput Topics TCP sliding window protocol TCP PUSH flag TCP slow start Bulk data throughput 2 Introduction In this chapter we will discuss TCP s form of flow control called a sliding window protocol It allows

More information

req unit unit unit ack unit unit ack

req unit unit unit ack unit unit ack The Design and Implementation of ZCRP Zero Copying Reliable Protocol Mikkel Christiansen Jesper Langfeldt Hagen Brian Nielsen Arne Skou Kristian Qvistgaard Skov August 24, 1998 1 Design 1.1 Service specication

More information

UDP, TCP, IP multicast

UDP, TCP, IP multicast UDP, TCP, IP multicast Dan Williams In this lecture UDP (user datagram protocol) Unreliable, packet-based TCP (transmission control protocol) Reliable, connection oriented, stream-based IP multicast Process-to-Process

More information

The Transmission Control Protocol (TCP)

The Transmission Control Protocol (TCP) The Transmission Control Protocol (TCP) Application Services (Telnet, FTP, e-mail, WWW) Reliable Stream Transport (TCP) Unreliable Transport Service (UDP) Connectionless Packet Delivery Service (IP) Goals

More information

Parallel Computing Trends: from MPPs to NoWs

Parallel Computing Trends: from MPPs to NoWs Parallel Computing Trends: from MPPs to NoWs (from Massively Parallel Processors to Networks of Workstations) Fall Research Forum Oct 18th, 1994 Thorsten von Eicken Department of Computer Science Cornell

More information

Chapter 2 - Part 1. The TCP/IP Protocol: The Language of the Internet

Chapter 2 - Part 1. The TCP/IP Protocol: The Language of the Internet Chapter 2 - Part 1 The TCP/IP Protocol: The Language of the Internet Protocols A protocol is a language or set of rules that two or more computers use to communicate 2 Protocol Analogy: Phone Call Parties

More information

CS457 Transport Protocols. CS 457 Fall 2014

CS457 Transport Protocols. CS 457 Fall 2014 CS457 Transport Protocols CS 457 Fall 2014 Topics Principles underlying transport-layer services Demultiplexing Detecting corruption Reliable delivery Flow control Transport-layer protocols User Datagram

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 25, 2018

CMSC 417. Computer Networks Prof. Ashok K Agrawala Ashok Agrawala. October 25, 2018 CMSC 417 Computer Networks Prof. Ashok K Agrawala 2018 Ashok Agrawala Message, Segment, Packet, and Frame host host HTTP HTTP message HTTP TCP TCP segment TCP router router IP IP packet IP IP packet IP

More information

Network Technology 1 5th - Transport Protocol. Mario Lombardo -

Network Technology 1 5th - Transport Protocol. Mario Lombardo - Network Technology 1 5th - Transport Protocol Mario Lombardo - lombardo@informatik.dhbw-stuttgart.de 1 overview Transport Protocol Layer realizes process to process communication data unit is called a

More information

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ

A First Implementation of In-Transit Buffers on Myrinet GM Software Λ A First Implementation of In-Transit Buffers on Myrinet GM Software Λ S. Coll, J. Flich, M. P. Malumbres, P. López, J. Duato and F.J. Mora Universidad Politécnica de Valencia Camino de Vera, 14, 46071

More information

OSI Layer OSI Name Units Implementation Description 7 Application Data PCs Network services such as file, print,

OSI Layer OSI Name Units Implementation Description 7 Application Data PCs Network services such as file, print, ANNEX B - Communications Protocol Overheads The OSI Model is a conceptual model that standardizes the functions of a telecommunication or computing system without regard of their underlying internal structure

More information

Announcements Computer Networking. Outline. Transport Protocols. Transport introduction. Error recovery & flow control. Mid-semester grades

Announcements Computer Networking. Outline. Transport Protocols. Transport introduction. Error recovery & flow control. Mid-semester grades Announcements 15-441 Computer Networking Lecture 16 Transport Protocols Mid-semester grades Based on project1 + midterm + HW1 + HW2 42.5% of class If you got a D+,D, D- or F! must meet with Dave or me

More information

Intro to LAN/WAN. Transport Layer

Intro to LAN/WAN. Transport Layer Intro to LAN/WAN Transport Layer Transport Layer Topics Introduction (6.1) Elements of Transport Protocols (6.2) Internet Transport Protocols: TDP (6.5) Internet Transport Protocols: UDP (6.4) socket interface

More information

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca

Homework 1. Question 1 - Layering. CSCI 1680 Computer Networks Fonseca CSCI 1680 Computer Networks Fonseca Homework 1 Due: 27 September 2012, 4pm Question 1 - Layering a. Why are networked systems layered? What are the advantages of layering? Are there any disadvantages?

More information

OSI Transport Layer. objectives

OSI Transport Layer. objectives LECTURE 5 OSI Transport Layer objectives 1. Roles of the Transport Layer 1. segmentation of data 2. error detection 3. Multiplexing of upper layer application using port numbers 2. The TCP protocol Communicating

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Part ONE

Part ONE Networked Systems, COMPGZ0, 0 Answer TWO questions from Part ONE on the answer booklet containing lined writing paper, and answer ALL questions in Part TWO on the multiple-choice question answer sheet.

More information

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 Network Working Group Request for Comments: 969 David D. Clark Mark L. Lambert Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 1. STATUS OF THIS MEMO This RFC suggests a proposed protocol

More information

Introduction to Open System Interconnection Reference Model

Introduction to Open System Interconnection Reference Model Chapter 5 Introduction to OSI Reference Model 1 Chapter 5 Introduction to Open System Interconnection Reference Model Introduction The Open Systems Interconnection (OSI) model is a reference tool for understanding

More information

Applied Networks & Security

Applied Networks & Security Applied Networks & Security TCP/IP Protocol Suite http://condor.depaul.edu/~jkristof/it263/ John Kristoff jtk@depaul.edu IT 263 Spring 2006/2007 John Kristoff - DePaul University 1 ARP overview datalink

More information

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA TCP over Wireless Networks Using Multiple Acknowledgements (Preliminary Version) Saad Biaz Miten Mehta Steve West Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX

More information

Chapter 6. (Week 12) The Transport Layer (CONTINUATION) ANDREW S. TANENBAUM COMPUTER NETWORKS FOURTH EDITION PP

Chapter 6. (Week 12) The Transport Layer (CONTINUATION) ANDREW S. TANENBAUM COMPUTER NETWORKS FOURTH EDITION PP Chapter 6 (Week 12) The Transport Layer (CONTINUATION) ANDREW S. TANENBAUM COMPUTER NETWORKS FOURTH EDITION PP. 524-574 1 THE TRANSPORT LAYER S TASK IS TO PROVIDE RELIABLE, COST- EFFECTIVE DATA TRANSPORT

More information

TCP/IP Performance ITL

TCP/IP Performance ITL TCP/IP Performance ITL Protocol Overview E-Mail HTTP (WWW) Remote Login File Transfer TCP UDP IP ICMP ARP RARP (Auxiliary Services) Ethernet, X.25, HDLC etc. ATM 4/30/2002 Hans Kruse & Shawn Ostermann,

More information

CS4700/CS5700 Fundamentals of Computer Networks

CS4700/CS5700 Fundamentals of Computer Networks CS4700/CS5700 Fundamentals of Computer Networks Lecture 14: TCP Slides used with permissions from Edward W. Knightly, T. S. Eugene Ng, Ion Stoica, Hui Zhang Alan Mislove amislove at ccs.neu.edu Northeastern

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

A Client Oriented, IP Level Redirection Mechanism. Sumit Gupta. Dept. of Elec. Engg. College Station, TX Abstract

A Client Oriented, IP Level Redirection Mechanism. Sumit Gupta. Dept. of Elec. Engg. College Station, TX Abstract A Client Oriented, Level Redirection Mechanism Sumit Gupta A. L. Narasimha Reddy Dept. of Elec. Engg. Texas A & M University College Station, TX 77843-3128 Abstract This paper introduces a new approach

More information

Introduction to Networking. Operating Systems In Depth XXVII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Introduction to Networking. Operating Systems In Depth XXVII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Introduction to Networking Operating Systems In Depth XXVII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Distributed File Systems Operating Systems In Depth XXVII 2 Copyright 2017 Thomas W.

More information

OSI Transport Layer. Network Fundamentals Chapter 4. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

OSI Transport Layer. Network Fundamentals Chapter 4. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1 OSI Transport Layer Network Fundamentals Chapter 4 Version 4.0 1 Transport Layer Role and Services Transport layer is responsible for overall end-to-end transfer of application data 2 Transport Layer Role

More information

Chapter III. congestion situation in Highspeed Networks

Chapter III. congestion situation in Highspeed Networks Chapter III Proposed model for improving the congestion situation in Highspeed Networks TCP has been the most used transport protocol for the Internet for over two decades. The scale of the Internet and

More information

Transport Protocols Reading: Sections 2.5, 5.1, and 5.2. Goals for Todayʼs Lecture. Role of Transport Layer

Transport Protocols Reading: Sections 2.5, 5.1, and 5.2. Goals for Todayʼs Lecture. Role of Transport Layer Transport Protocols Reading: Sections 2.5, 5.1, and 5.2 CS 375: Computer Networks Thomas C. Bressoud 1 Goals for Todayʼs Lecture Principles underlying transport-layer services (De)multiplexing Detecting

More information

Introduction to Internetworking

Introduction to Internetworking Introduction to Internetworking Introductory terms Communications Network Facility that provides data transfer services An internet Collection of communications networks interconnected by bridges and/or

More information

TCP: Flow and Error Control

TCP: Flow and Error Control 1 TCP: Flow and Error Control Required reading: Kurose 3.5.3, 3.5.4, 3.5.5 CSE 4213, Fall 2006 Instructor: N. Vlajic TCP Stream Delivery 2 TCP Stream Delivery unlike UDP, TCP is a stream-oriented protocol

More information

The Cambridge Backbone Network. An Overview and Preliminary Performance. David J. Greaves. Olivetti Research Ltd. Krzysztof Zielinski

The Cambridge Backbone Network. An Overview and Preliminary Performance. David J. Greaves. Olivetti Research Ltd. Krzysztof Zielinski The Cambridge Backbone Network An Overview and Preliminary Performance David J. Greaves Olivetti Research Ltd. University of Cambridge, Computer Laboratory Krzysztof Zielinski Institute of Computer Science

More information

TCP : Fundamentals of Computer Networks Bill Nace

TCP : Fundamentals of Computer Networks Bill Nace TCP 14-740: Fundamentals of Computer Networks Bill Nace Material from Computer Networking: A Top Down Approach, 6 th edition. J.F. Kurose and K.W. Ross Administrivia Lab #1 due now! Reminder: Paper Review

More information

Transport Protocols Reading: Sections 2.5, 5.1, and 5.2

Transport Protocols Reading: Sections 2.5, 5.1, and 5.2 Transport Protocols Reading: Sections 2.5, 5.1, and 5.2 CE443 - Fall 1390 Acknowledgments: Lecture slides are from Computer networks course thought by Jennifer Rexford at Princeton University. When slides

More information

EEC-682/782 Computer Networks I

EEC-682/782 Computer Networks I EEC-682/782 Computer Networks I Lecture 16 Wenbing Zhao w.zhao1@csuohio.edu http://academic.csuohio.edu/zhao_w/teaching/eec682.htm (Lecture nodes are based on materials supplied by Dr. Louise Moser at

More information

Acknowledgment packets. Send with a specific rate TCP. Size of the required packet. XMgraph. Delay. TCP_Dump. SlidingWin. TCPSender_old.

Acknowledgment packets. Send with a specific rate TCP. Size of the required packet. XMgraph. Delay. TCP_Dump. SlidingWin. TCPSender_old. A TCP Simulator with PTOLEMY Dorgham Sisalem GMD-Fokus Berlin (dor@fokus.gmd.de) June 9, 1995 1 Introduction Even though lots of TCP simulators and TCP trac sources are already implemented in dierent programming

More information

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing.

Announcements. IP Forwarding & Transport Protocols. Goals of Today s Lecture. Are 32-bit Addresses Enough? Summary of IP Addressing. IP Forwarding & Transport Protocols EE 122: Intro to Communication Networks Fall 2007 (WF 4-5:30 in Cory 277) Vern Paxson TAs: Lisa Fowler, Daniel Killebrew & Jorge Ortiz http://inst.eecs.berkeley.edu/~ee122/

More information

1 Introduction Myrinet grew from the results of two ARPA-sponsored projects. Caltech's Mosaic and the USC Information Sciences Institute (USC/ISI) ATO

1 Introduction Myrinet grew from the results of two ARPA-sponsored projects. Caltech's Mosaic and the USC Information Sciences Institute (USC/ISI) ATO An Overview of Myrinet Ralph Zajac Rochester Institute of Technology Dept. of Computer Engineering EECC 756 Multiple Processor Systems Dr. M. Shaaban 5/18/99 Abstract The connections between the processing

More information

Applications PVM (Parallel Virtual Machine) Socket Interface. Unix Domain LLC/SNAP HIPPI-LE/FP/PH. HIPPI Networks

Applications PVM (Parallel Virtual Machine) Socket Interface. Unix Domain LLC/SNAP HIPPI-LE/FP/PH. HIPPI Networks Enhanced PVM Communications over a HIPPI Local Area Network Jenwei Hsieh, David H.C. Du, Norman J. Troullier 1 Distributed Multimedia Research Center 2 and Computer Science Department, University of Minnesota

More information

TCP/IP Transport Layer Protocols, TCP and UDP

TCP/IP Transport Layer Protocols, TCP and UDP TCP/IP Transport Layer Protocols, TCP and UDP Learning Objectives Identify TCP header fields and operation using a Wireshark FTP session capture. Identify UDP header fields and operation using a Wireshark

More information

Communication Networks

Communication Networks Communication Networks Spring 2018 Laurent Vanbever nsg.ee.ethz.ch ETH Zürich (D-ITET) April 30 2018 Materials inspired from Scott Shenker & Jennifer Rexford Last week on Communication Networks We started

More information

EE 122: IP Forwarding and Transport Protocols

EE 122: IP Forwarding and Transport Protocols EE 1: IP Forwarding and Transport Protocols Ion Stoica (and Brighten Godfrey) TAs: Lucian Popa, David Zats and Ganesh Ananthanarayanan http://inst.eecs.berkeley.edu/~ee1/ (Materials with thanks to Vern

More information

CMSC 611: Advanced. Interconnection Networks

CMSC 611: Advanced. Interconnection Networks CMSC 611: Advanced Computer Architecture Interconnection Networks Interconnection Networks Massively parallel processor networks (MPP) Thousands of nodes Short distance (

More information

TSIN02 - Internetworking

TSIN02 - Internetworking Lecture 4: Transport Layer Literature: Forouzan: ch 11-12 2004 Image Coding Group, Linköpings Universitet Lecture 4: Outline Transport layer responsibilities UDP TCP 2 Transport layer in OSI model Figure

More information

CE693 Advanced Computer Networks

CE693 Advanced Computer Networks CE693 Advanced Computer Networks Review 2 Transport Protocols Acknowledgments: Lecture slides are from the graduate level Computer Networks course thought by Srinivasan Seshan at CMU. When slides are obtained

More information

EEC-484/584 Computer Networks. Lecture 16. Wenbing Zhao

EEC-484/584 Computer Networks. Lecture 16. Wenbing Zhao EEC-484/584 Computer Networks Lecture 16 wenbing@ieee.org (Lecture nodes are based on materials supplied by Dr. Louise Moser at UCSB and Prentice-Hall) Outline 2 Review Services provided by transport layer

More information

Talk Outline. System Architectures Using Network Attached Peripherals

Talk Outline. System Architectures Using Network Attached Peripherals System Architectures Using Network Attached Peripherals Rodney Van Meter USC/Information Sciences Institute rdv@isi.edu http://www.isi.edu/netstation/ Introduction USC Integrated Media Systems Center Student

More information

Communication Networks

Communication Networks Communication Networks Prof. Laurent Vanbever Exercises week 4 Reliable Transport Reliable versus Unreliable Transport In the lecture, you have learned how a reliable transport protocol can be built on

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

7. TCP 최양희서울대학교컴퓨터공학부

7. TCP 최양희서울대학교컴퓨터공학부 7. TCP 최양희서울대학교컴퓨터공학부 1 TCP Basics Connection-oriented (virtual circuit) Reliable Transfer Buffered Transfer Unstructured Stream Full Duplex Point-to-point Connection End-to-end service 2009 Yanghee Choi

More information

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p Performance of TCP/IP Using ATM ABR and UBR Services over Satellite Networks 1 Shiv Kalyanaraman, Raj Jain, Rohit Goyal, Sonia Fahmy Department of Computer and Information Science The Ohio State University

More information

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski Operating Systems 16. Networking Paul Krzyzanowski Rutgers University Spring 2015 1 Local Area Network (LAN) LAN = communications network Small area (building, set of buildings) Same, sometimes shared,

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Page 1. Goals for Today" Discussion" Example: Reliable File Transfer" CS162 Operating Systems and Systems Programming Lecture 11

Page 1. Goals for Today Discussion Example: Reliable File Transfer CS162 Operating Systems and Systems Programming Lecture 11 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 11 Reliability, Transport Protocols" Finish e2e argument & fate sharing Transport: TCP/UDP Reliability Flow control October 5, 2011

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Outline Computer Networking. Functionality Split. Transport Protocols

Outline Computer Networking. Functionality Split. Transport Protocols Outline 15-441 15 441 Computer Networking 15-641 Lecture 10: Transport Protocols Justine Sherry Peter Steenkiste Fall 2017 www.cs.cmu.edu/~prs/15 441 F17 Transport introduction TCP connection establishment

More information

CSCD 330 Network Programming Winter 2015

CSCD 330 Network Programming Winter 2015 CSCD 330 Network Programming Winter 2015 Lecture 11a Transport Layer Reading: Chapter 3 Some Material in these slides from J.F Kurose and K.W. Ross All material copyright 1996-2007 1 Chapter 3 Sections

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

CMPE150 Midterm Solutions

CMPE150 Midterm Solutions CMPE150 Midterm Solutions Question 1 Packet switching and circuit switching: (a) Is the Internet a packet switching or circuit switching network? Justify your answer. The Internet is a packet switching

More information

Programming Assignment 3: Transmission Control Protocol

Programming Assignment 3: Transmission Control Protocol CS 640 Introduction to Computer Networks Spring 2005 http://www.cs.wisc.edu/ suman/courses/640/s05 Programming Assignment 3: Transmission Control Protocol Assigned: March 28,2005 Due: April 15, 2005, 11:59pm

More information

Process/ Application NFS NTP SMTP UDP TCP. Transport. Internet

Process/ Application NFS NTP SMTP UDP TCP. Transport. Internet PERFORMANCE CONSIDERATIONS IN FILE TRANSFERS USING FTP OVER WIDE-AREA ATM NETWORKS Luiz A. DaSilva y, Rick Lett z and Victor S. Frost y y Telecommunications & Information Sciences Laboratory The University

More information