Virtual Interface Architecture over Myrinet. EEL Computer Architecture Dr. Alan D. George Project Final Report

Size: px

Start display at page:

Download "Virtual Interface Architecture over Myrinet. EEL Computer Architecture Dr. Alan D. George Project Final Report"

Albert Ford
5 years ago
Views:

1 Virtual Interface Architecture over Myrinet EEL Computer Architecture Dr. Alan D. George Project Final Report Department of Electrical and Computer Engineering University of Florida Edwin Hernandez December 1998.

2 Implementation of a Virtual Interface Architecture over Myrinet Edwin Hernandez - hernande@hcs.ufl.edu 1. Introduction The evolution of network interfaces has been improving over years, but he network protocols add a big amount of overhead that it is translated into higher latency and low useful throughput. The traditional protocols such as TCP, UDP and TP4 are not appropriate for high performance environments where lightweight communications are required in order to maximize the bandwidth theoretically possible to achieve. In other words, protocols have to be light-weight and they should maximize the useful throughput and allow minimum latency. Several companies, software and hardware, visioning this problem have come up with new ideas. One of them is called, Virtual Interface Architecture (VIA), as a matter of fact VIA was born from the combined force of COMPAQ, Microsoft and INTEL [VIA98]. The Virtual Interface Specification is defined as a standard and several organizations agree with it and right now the version 1.0 of the standard is available. [VIA97]. Several papers have been published regarding the Virtual Interface Architecture in which several Virtual Interfaces, VI, have been implemented to probe the concept of VI as well as probe the reduction in latency and bandwidth. They have used different NIC such as Myrinet and even Ethernet [Dunn98] and [Eick98]. However, This work was previously started with PA-RISC network interface architectures [Banks93], virtual protocols for myrinet as stated in [Rosu95], moreover several researchers have tried to localize the bottlenecks and performance improvements in NIC's like the work done by [Davi93] and [Rama93] in which they have stated the general memory management concepts as well as I/O handling techniques. As shown in Section 5, all the measurements and tests were developed at the Myrinet test-bed in the High Performance Computing and Simulation Research Lab (HCS Lab) at the University of Florida. 2. Background The Virtual Interface Architecture is a new concept, this concept perfectly fits the ideas and conceptions of high performance networks and the design of clusters. The VIA tries to boost the performance by not allowing excessive copying and performing several tasks without care of many layers and some other important issues, those issues are explained in section 2.1.

3 2.1. The Virtual Interface Architecture. [VIA98] The VIA attacks the problem of relatively low achievable performance of inter-process communication (IPC) within a cluster 1. The overhead is the one that determines the performance of IPC. The software overhead added during the send/receive operations over a message through the network. The amount of software layers that are traversed, imply a great amount of context switches, interrupts and data copies when crossing those boundaries. However the increase of the processor clock helps in the processing of software layers, this is not a determinant factor in the reduction of the performance (large penalties for cache miss, software layers imply a lot of branches). With the introduction of OC-3 ATM, network bandwidth are being increased from 1Mbps to Mbps and 1Gbps as backbones, but the "raw" bandwidth almost never can be achieved. Having those two reasons into consideration, INTEL and other companies have developed the VIA, which can be described by: User Agent Kernel Agent. An user is the software layer using the architecture, it could be an application or communication services layer. The kernel agent is a driver running in protected (kernel) mode. It must set up the necessary tables and structures that allow communication between cooperating processes. VIA accomplishes low latency in a message-passing environment by following these rules: Eliminate any intermediate copies of the data Eliminate the need of a driver running in protected kernel mode to multiplex a hardware resource. Avoid traps into the operating system whenever possible to avoid context switches in the CPU as well as cache thrashing Remove the constraint of requiring an interrupt when initiating an I/O operation Define a simple set of operations that send and receive data. Keep the architecture simple enough to be emulated in software. What VIA does with the processes is that it presents an illusion that it owns the interface to the network. Each VIA consists of one send and one receive queue, and is owned and maintained by a single process. 1 Cluster computing consists of short distance, low-latency, high-bandwidth IPCs between multiple building blocks. Cluster building blocks include server, workstations and I/O subsystems, all of which connect directly to a network.

4 A process can own many Virtual Interfaces (VI), and many processes can own many Vis, the kernel by itself can also own a VI. The VI queue is formed by a linked list of variable-length descriptors. To add a descriptor to a queue, the user builds the descriptor and posts it onto the tail of the appropriate work queue. That same user pulls the completed description off the head of the same work queue they were posted on. The process that owns the queue can post four types of descriptors. Send, remote-dma/write, remote- DMA/read descriptors are placed on the send queue of a VI. Receive descriptors are placed on the receive queue of a VI. VIA also provides polling and blocking mechanisms to synchronize between the user process and completed operations. When descriptor processing completes, the NIC writes a done bit and includes any error bits associated with that descriptor in its specified fields. This act transfers ownership of the descriptor from the NIC back to the process that originally posted it. These queues are an additional construct that allows the coalescing of completion notifications from multiple work queues into a single queue. The two work queues of one VI can be associated with completion queues independently of one another. Now the descriptors mentioned are constructs which describe the work to be done by the Network Interface. This is very similar to the architecture proposed in [Davi93]. The SEND/RECEIVE descriptors contains one segment and a variable number of data segments. Remote-DMA/write and remote- DMA/receive descriptors contain one additional address segment following the control segment and preceding the data segments. The VIA also has : immediate data access of a 32-bit data in a descriptor. The order of the descriptors is preserved in a FIFO queue, it is easy to maintain consistency with send/recive and remote/dma write, however remote/dma recieve is a round-trip transaction and it is not completed until the requested data is returned from the remote node/endpoint. Work queue scheduling. There is no implicit ordering relationship between descriptors and VIs therefore the scheduling service depends on the algorithm used by the NIC. Memory protection, provides memory protection and ensure that a user process cannot send out of, or receive into, memory that it does not own. Virtual address Translation. This is done when the kernel agent registers a memory region, the kernel agent performs ownership checks (it comes from a user agent request), it pins the pages into physical memory, and probes the regions of the virtual-to-physical address translation.

5 Vi Consumer Application OS Communication Interface VI User Agent User mode kernel mode Send/Receive/RDMA Read/ RDMA write VI KERNEL Agent Se nd Re cv V I Se nd Re cv V I Se nd Re cv V I VI Network Adapter Figure 1. VI Architectural Model 3. MODEL DESIGN For this class project, there is not enough time to build a Hardware implementation of the VI on-chip using the Myrinet interface, however it is highly possible to interact with the Myrinet card and generate a Virtual Emulation of the Virtual Interface merely in software. For this reason, in Appendix 1., there is the source code in C++, which is located on the top of the Myrinet, therefore the performance enhancements reached won't be as high as expected, however the Model followed by this project will remain with some modifications. It should be noticed that the RDMA transfers, as well as error handling where left aside for this project, the only concerns in the design of the VI were: VI initialization and interaction with the VI Implement the Send and Receive Queues Implement the completion queues Use the standard data-types mentioned in the specification [VIA97] Make use of a small application ECHO/REPLY Server for the performance tests. In addition, the software makes use of the Myrinet adapter in making transfers in DMA mode. At very early stage this was not taken into account but it seems to be not a good performance "booster" and the measurements made are quite low, but his aspect will be explained in Section 5.

6 The basic objects used are: Myrinet, which is in charge of handling the send, receive and initialization, it dialogs directly to the myrinet card. - int(), initializes the interface, in this case it is needed to change the route of the Myrinet DMA transfer. In other words, first the interface sends data, then it has to be reinitialized to receive the Reply from the server, this also happens at the server. - Send() and Recv(), post and interacts with the shmen* structure in terms of receiving the data from the Myrinet's SRAM and post data into it. VI, The Virtual Interfac Object contains the Send Queue, the Receive Queue and the Completion Queue. A description of the class members is posted above: - NIC. Object to reference the instance of the interface being used, in this case Myrinet. - CQ, SendQ, RecvQ: These data types are used for the queues of descriptors, they are handled as a List objet. The List object was also developed and it contains all the functions of a link list. - SetupDescriptorSendRecv(), this functios is made to initialize a descriptor whether for send or receive, to or from the working queues. The descriptor created here has the data-type VIP_DESCRIPTOR defined in the VIA specification. - ViPostSend().This function is in charge of Posting the Send Descriptor to the queue, it does not transmits the data. - ViProcesSend(). This function pops the first descriptor from the queue and starts delivering the content pointed by the descriptor (DS[0].Local.Data.Address) to the shmem->sendbuffer pointer. This is the real Send. - ViPostRecv(), is in charge of posting a reception descriptor into the receive queue, it has information to the data addresses to store the information. An application can also receive a descriptor from the other end, depending on the protocol used. For this basic application the receive descriptor is formed in the reception peer. - ViProcessRecv(), the process of reception is done though this method, and as well as the viaprocessend() pops the first element in the Reception queue and writes whatever is red from the object NIC into the destination address. - EchoServer(). Member function to allow be a Server waiting for incoming data and replying the same data. - EchoClient(). Member function to allow Sending a block of MTU (Maximum Transfer Unit) data to the other end, expects for a reply and compares whatever was sent with the content of the receiving data. The VIPL.h is the most important of the libraries, because it contains all the data types stated in the specification., it defines descriptors, responses, error handling, memory management and some other VI

7 properties. However, it was not implemented as stated there, it was modified to agree with the requirements of this project and the HCS Lab resources. As stated before two aspects where left aside for the VI application implementation: a) Threads and multi-threading issues, in order to keep the VI clean of the vices used in the other protocols, it is imperative to use a library of light-weight threads, otherwise all the overhead introduced by the traditional thread libraries will twist the results b) Remote DMA reads and writes, there are two main reasons for leaving this aside, the first one is requirements of direct memory manipulation which is not permitted without the proper system administrator rights and the second one, because is not quite clear in the standard how to achieve it. c) Error handling was not implemented and a Error-Free environment should be assumed for all the results. 4. EXPERIMENTS Experiments were directed in three main areas: Latency, Throughput and Time overhead attributed to the client and server of the application developed. They were also made following the performance results obtained by [Berry97] and [Erick98]. In fact, the values gathered by them have much higher performance than the values gathered at the HCS Lab, the reason could be the VI implementation in hardware, not a software emulation and a better understanding of the Myrinet architecture, in terms of modes of operation and how to improve the data transfers. Fist of all, the SAN used consisted of two computers, viking and vigilante, both Sun Ultra-2 interconnected through the Myrinet switch version 1.0, Berry and his team used Pentium Pro 200 MHz, PCI bus and there is not much specification concerning the application used. The set of experiments selected consisted on: - Throughput in Myrinet with and without the VI - Latency with and without the VI - Time distribution at the Client and Server using the VI The results and analysis are show in section RESULTS AND ANALYSIS The mode of operation for the Myrinet adapter was DMA transfer and as shown in Figure 2., it was not the best option, however it fulfilled the requirement of an easy implementation.

8 Mbytes 20 Throughput Using DMA Throughput Using Mem_map Thorughput TCP_STREM payload in bytes Figure 2. Throughput measurements for Myrinet using different modes of operation As shown there the DMA transfer does not improve the performance whenever is 64 bytes long, moreover a TCP_STREAM test made with netperf generates a better performance. But the main goal with this paper is not finding a better mode of operation for Myrinet, if not a way to probe that the VIA is a good concept and it can be used in SANs. Having this in mind, it will be only a matter of transport or multiply by a factor of performance improvement of whatever is found from now on. The first measurement made consists of the Latency with and without the VI, the RAW-Myrinet represents the application without the VI overhead, or bulk data transfers and the VI-Myrinet represents the Latency of the ECHO/Reply Round trip divided by two. Latency of the Myrinet Interface Micro-seconds 600 Raw - Myrinet VI - Myrinet Payload (bytes) Figure 3. Latency measurements with myrinet using raw data and the VI on the top of the myrinet

9 From figure 3, can be concluded that the increment in the latency is a constant and not greater than 25%. If the results shown here are compare with the ones reported by Berry, the difference between latencies are of a ration of 4:1, in which this ones are the greatest. However, a aspect that is not being taken into account by Berry and this project is the fact that the VIA specification defines as a Maximum Transfer Unit of 32KB, but all the measurements where done by the other researches at payloads not greater than 8Kbytes. Throughput of the VI and the Raw Data Transfer with Myrinet Mbytes/sec 6 VI - Myrinet DMA-Myrinet Payload Figure 4. Throughput measurements using with and without the VI In terms of throughput, the performance is decreased in about 40% between the raw-data transfer and the transfer done using the VI, this value was not expected and unfortunately there are no references to compare. Generally, VI performance is compare between the Kernel Agent implementation and the VI emulation, but not against the raw transfer performance. In addition to the throughput, it is required to find where is the performance bottleneck standing, in other words where is the 40% of the value in mention is lost. In order to get and discover this value, time stamping was executed along the client and server application. Although the measurement could be done and compare both client and server, it is at the client the most representative of the two peers.

10 For this reason in figure 5.0, it is shown the distribution of the time in every process of the application. In the VIA the process of descriptors is basically negligible and most of the time is spent in the data transfer and reception of the reply (waiting to receiving descriptor). VI Distribution of Time at the CLIENT (MTU=8192) 5% 4% 0% 1% 1% 39% 50% Setting Send Descriptor Posting Descriptor Processing Send Myrinet Send (DMA) Waiting to Receive Desc Receiving Descriptor (CQ ready) Readind Data Figure 5. Time distribution of the Echo/Reply application This behavior is expected because the 30% of Myrinet-Myrinet transfer has to be in both ends, therefore if 12% is spent at both ends, it will represent approximately 24% of overhead left, plus from the 50% which includes 30% server reply, will leave a 10%. more, ends up in 30-35% of processing overhead. This processing overhead is a constant as shown in Figure 6.0, and it is a matter of improvement of the Myrinetto-Myrinet transmission to get better performance levels. Time Spent Processing Descriptors micro-seconds 150 Client Side Server Side Payload Figure 6. Time variation of descriptors processing at client and server.

11 This behavior (on Figure 6.) is explained by the algorithm itself, a block is sent to the server from the client, the client waits for that block of data, the server copies the data pointed by the descriptor and writes a send queue with the same data, the data is sent back to the client. In other words, only one descriptor is needed for sending or all working queues handle only one-element at the time. 6. CONCLUSIONS First, a proof-of-concept has been achieved at the HCS Lab, the philosophy of the VIA in SAN can be applied reducing the complexity of the OSI models and layered protocols. The implementation developed introduces an overhead of 10% on descriptors and data processing, at both ends. For an echo-reply application the average overhead, having as a reference raw-myrinet transmission using DMA is of 40%. The average latency added by the VIA to the new application is of 25% maximum. It turns out that the use of the DMA transfer was not the best choice, it is recommended to use any other technique. 7. FUTURE RESEARCH Further research could be done in terms of implementation, first improve the Myrinet-to-Myrinet communication, using mem_map instead of DMA. Implement error checking and Remote DMA reads and writes. Make use of the SCALE_Threads or any other light-weight library and use multi-issue implementation. In addition to that, the queues, completion, reception and transmission could be modified and switching form a simple FIFO to something more efficient such as a Hash table, which will have low processing overhead but could improve the performance. ACKNOWLEDGEMENTS I'd like to thank to Wade Cherry for his introduction and explanation of the LANAI and Myrinet applications. I'd like to thank the team at INTEL for defining the VIPL.H Library and providing the source code for free use through the internet, and their visual C++ application from which I gathered lots of ideas and finally understood the philosophy of VIA. REFERENCES [Banks93] Banks, D., Prudence M. "A high-performance Network Architecture for PA-RISC Workstation", IEEE Journal on Selected Areas of Communications", vol. 11, No. 2, February 1998, pp [Berry97] [Davi93] Berry, F. Deleganes, E. "The Virtual Interface Architecture Proof-of-Concept Performance Results", INTEL Corp white paper. Davie, B. "Architecture and Implementation of a High-Speed Host Interface", IEEE Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp

12 [Dunn98] [Eick98] Dunning, D, et.al "The Virtual Interface Architecture", IEEE MICRO, v 18, n 2, April 1998, pg Von Eicken, T., Vogels, W. "Evolution of the Virtual Interface Architecture ", IEEE Computer Magazine, November 1998, pp [Rama93] Ramakrishnan, K, "Performance Considerations in Designing Network Interfaces", ", IEEE Journal on Selected Areas in Communications", Vol. 11, No. 2, February 1998., pp [Rosu95] Marcel, Rosu. "Processor Controller off-processor I/O", Cornell University, Grant ARPA/ONR N J-1866, August [Steen97] Steenkiste, P."A High Speed Network Interface for Distributed-Memory Systems : Architecture and Applications ", ACM Transactions on Computer Systems, Vol. 15, No. 1, February 1997, pp [Wels98] Welsh, M., et.al." Memory Management for User-Level Network Interfaces ", v 18, n 2, April 1998, pp Web Pages [VIA98] [INT98] APPENDICES Appendix 1. Source Code for the VIA_Server

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct