Operating System Support for High-Performance Networking, A Survey

Size: px

Start display at page:

Download "Operating System Support for High-Performance Networking, A Survey"

Job Terry
5 years ago
Views:

1 Operating System Support for High-Performance Networking, A Survey Zhiqing Liu and Peng Wang Computer and Information Science Department Indiana University Purdue University Indianapolis 723 W. Michigan Street, SL280 Indianapolis, IN {zliu, pwang}@cs.iupui.edu Abstract Crucial applications require direct and efficient access to emerging high-performance networks. This paper surveys the networking subsystem overhead for high-performance networking and the current operating system techniques to address the overhead. This paper aims to identify a few directions for future research work. Keywords. operating systems, networking subsystem design I. Introduction Many crucial applications such as teleconferencing, biomedical tele-visualization, distributed computing are becoming feasible with emerging high-performance network services. These network services such as FDDI, ATM and HIPPI provide high bandwidth and low latency. For example, FDDI could operate at 100Mbps; Sonet/ATM OC-12 could operate at 622 Mbps. Some high-speed LANs such as Myrinet and Gigabit-Ethernet, and Sonet/ATM OC-48 network backbones can operate at 1Gbps and above. However, these high-performance network services are yet to be fully utilized by demanding applications running on modern workstations, in part because of the lack of support in operating system for delivering high-volume network traffic from the physical network level to the application level. Specifically, traditional TCP/IP, the Internet protocol suite, sustains a much lower throughput than the bandwidth of a highperformance network, its latency is high and the CPU can be easily saturated. Since the late 1980s, a lot of research have focused on identifying the bottlenecks in the conventional network subsystem and improving its performance. Various obstacles have been identified and a wide variety of designs, techniques and implementations have been presented to this end, which will be surveyed in this paper. Section2 presents our criteria for evaluating the efficiency of the network subsystem in the context of modern workstations. Section 3 analyzes the interaction between hardware and software components of the network subsystem to identify the overhead. Based on this analysis, Section 4 discusses different techniques for addressing the identified overhead. Conclusions are drawn in Section5. A. Hardware components II. Evaluation Criteria Before presenting our evaluation criteria of the network subsystem, we briefly review its hardware and software components. The hardware components of the network subsystem in a modern workstation include the network adapter, the bus architecture, the hierarchical memory (caches and the main memory), and the CPU. A modern network adapter usually has a dedicated CPU, DMA engines, and a substantial buffer space organized as FIFO or other more sophisticated structures. It may also provide more advanced functionality such as outboard buffering and/or protocol processing. With a suitable driver, a network adapter can operate at the network speed level without problem. Steenkiste [Ste94] reported a range of high-speed network adapters operating at Mbps; Druschel [Dru94a] reported that Osiris network adapter could generate an aggregate I/O bandwidth of over 1.1 Gbps when adjusted for the ATM cell overhead. Connecting the CPU and the main memory unit, the memory bus is on the critical path through which network date flow. While the network data may or may not go through the I/O bus depending on the architecture, they must go through the memory bus at least once to get into the main memory. Processing the data by the CPU (e.g., copy and checksum) further increases the number of times in which the data pass through the memory bus, which severely reduces the effective memory bandwidth. Specifically, the effective memory bandwidth for processing the network data would be reduced to B/ N where B is the memory bandwidth and N is the number of times the network data pass the memory bus. Once network data are put into the main memory, a cache system can help reduce memory bus traffic and lower the average memory access latency, if spacial locality of the data can be effectively exploited. Modern desktop systems with PCI bus can achieve more than 1Gbps peak memory bus bandwidth [Gal99], the bandwidth that the memory subsystem can deliver during a burst-mode memory transfer. But most time the memory bus can not achieve this peak bandwidth and its sustained memory bus bandwidth is much lower. For example, Druschel s experiment [Dru93a] on DEC 5000/200 shows that a sustained read bandwidth of 300 Mbps, write bandwidth of 570 Mbps comparing to a peak bandwidth of 800 Mbps, and copy bandwidth is worse (100 Mbps). Although L1 and L2 caches are much

2 faster, but cache s effectiveness heavily depends on the operating system and application data s locality of reference. And from the research of Pagels [Pag94], caches are not effective in eliminating main memory traffic associated with network data access. B. Software components It is a standard and common practice to design and implement the software components of the network subsystem in a layered approach (e.g., the ISO reference model and the TCP/ IP protocol suite). Based on the layered approach, conceptually, sending a message means transferring it down through successive layers of protocol software on the sender, across the network, and then up through successive layers of protocol software on the receiver. The layering approach is based on the principle that a layer at the destination receives exactly the same object sent by the layer of the same level at the source. The traditional role of the operating system is to mediate and multiplex the access of multiple application processes to its computing resources. It not only provides services such as task scheduling, synchronization, virtually memory management, buffer management, cross-domain data transfer that the networking components need, but also protects the underlying hardware and the OS itself from accidental or malicious access. The networking components should be incorporated with the OS effectively to provide satisfactory services to the applications in need and to utilize the full capability of the underlying hardware components. We can identify three different approaches of arranging the network subsystem with respect to the operating system: Monolithic OS. Traditionally, most portion of the network protocol stack is thought as part of a monolithic operating system and incorporated into the OS kernel. The reason for this arrangement is mainly for the purposes of policy and performance. The key policies are fairness (e.g., in multiplexing packet streams) and prevention of starvation. High performance networking may require the ability to control timing and task scheduling, manipulate virtual memory directly, fully control peripheral devices, and so on. In addition to that, it is natural to put device driver under protection of the operating system. UNIX systems such as BSD UNIX and SunOS are in this category. Microkernel OS. Microkernel systems like Mach [Mae92, Acc86], representing a trend in the operating system design toward a modular structure to achieve ease of distribution, validation, configuration and maintenance, present a different point of view. As a server-based operating system, Mach provides support only for scheduling, virtual memory, and cross-address space IPC (Inter-Process Communication), higher level services such as UNIX emulation and network communication are implemented in userlevel servers. This kind of design is intrinsically ineffective in supporting high-performance networking [Dru94a, Pag94]: Different parts of the network subsystem reside in different protection domains, resulting in that I/O requests and associated data may have to cross additional protection boundaries and scheduling points. More cross-domain data transfers and more context switches may thus be needed than in a monolithic kernel, leading to bad locality of memory reference, increase of the memory bus load, reduction of the application throughput, and increase of the network latency. Bypassing OS. Because software overhead has become the bottleneck for high-performance networking, an alternative is to bypass the operating system so as to access the network adapter directly. One example is the Application Device Channels (ADC) [Dru93a, Dru93b, Dru94a, Dru94b, Dru96], which used techniques such as on-board demultiplexing, interrupt handling offloading, and protected data transferring ability of a high-performance ATM host adapter to give an application process restricted but direct access to the network adapter, bypassing the operating system kernel. Several user-level communication architectures have been developed to remove the operating system from the critical communication path in highperformance local area networks [Mai96, Bho98]. Rodrigues et al. [Rod97] presented an implementation called fast sockets, which is a local-area communication layer based on the Berkeley Active Message system and exports the Berkeley Sockets programming interface. To maintain communication compatibility, fast sockets transparently reverts to the standard TCP/IP protocols for wide area communication. Fast sockets realizes a round-trip transfer time of 60 ms and a maximum transfer bandwidth of 33 MB/s between two UltraSPARC-1s connected by a Myrinet network using packet size of 441 Bytes. However, because fast sockets is a user-level library, it has problems in achieving full compatibility with the traditional Sockets abstraction. For example, it can not be concurrently shared between two processes (e.g., via the fork system call); its state is lost upon the exec or the exit call, etc. C. Evaluation Criteria Four criteria are generally used in the literature to evaluate the performance of the networking subsystem under different circumstances, namely end-to-end throughput, round-trip latency, CPU utilization, and memory utilization. While endto-end throughput and round-trip latency are the most used criteria, CPU utilization is also an essential one for high-performance networking. The potential of high bandwidth has little value in practice if communication overhead leaves no CPU power to process the data. Thus any approach of connecting workstation to high-performance networks must preserve the ability to run the application while interacting with the network traffic. The application must be able to gain sufficient processor resource to send or absorb the traffic. Another related criterion that is even less noticed is the memory utilization. Most high-performance network subsystems make heavy use of DMA to transfer data directly between the network adapter and the main memory, without involving the CPU. However, DMA competes with the application for the memory bus, thus may cause memory access contention and decrease the processor efficiency during periods of heavy DMA traffic. In addition, the use of DMA can result in incorrect data being transferred as a result of inconsistency between the cache and the main memory. The DMA engine fetches old data from memory if new data appear only

3 in the cache (in systems with write-back caches only). The software or hardware must provide consistency by flushing the data to memory before starting the DMA. This increases both the memory traffic and the processing overhead. Thus, the overall impact on the memory utilization of a network subsystem using DMA needs to be carefully evaluated. So far to the knowledge of the authors, only one paper in the literature [Smi93] has evaluated the impact of an implemented network subsystem on the memory utilization. In addition to the criteria discussed above, other issues such as architecture cost, application and driver programming interfaces, and software complexity and compatibility also arise. This is because a lot of research effort is put on designing an optimal architecture for moving data between the application and the network interface without CPU intervention, and thus achieving the so-called one-copy or zero-copy TCP/IP. We will address these issues at their time of encounter. Packet loss rate is another issue arising when we consider a connectionless, unreliable protocol such as UDP. A satisfactory packet loss rate is usually application dependent. For realtime video, for example, a 5% loss rate is usually tolerable. III. Networking Overhead The overall performance of network I/O of a workstation depends on both hardware and software components of the network subsystem. From the viewpoint of hardware, the bandwidth of the high-speed networks is approaching the hardware limit of modern workstations. It has also been noted that the main memory bandwidth in the next generation workstations is not likely to enjoy order-of-magnitude increases in bandwidth. So far, it could be seen that, among the hardware components of the network subsystem of a workstation-class computer, the memory bus is the potential bottleneck in dealing with network I/O. The networking software has thus to be optimized to reduce the software overhead and the memory bandwidth usage so that an adequate bandwidth is delivered to the processes in need in a fair manner, and at the mean time, enough system resources are left for the application programs to execute normally. However, as Clark and others pointed out [Cla82, Ste94], there is no single dominant source of inefficiency in the networking software. Implementing efficient networking software involves looking at all the parts of the network subsystem including the network interface and protocol processing, and painstakingly tuning the entire subsystem. Clark [Cla82] discussed trade-offs between modularity and efficiency in protocol implementation and identified the possible outstanding overhead in protocol processing; Clark et al. [Cla89] analyzed the TCP processing overhead; Clark and Tennenhouse [Cla90] presented architectural considerations for high-speed protocols; Dalton et al. [Dal93] and Steenkiste [Ste94] analyzed bottlenecks in traditional network subsystem in their papers; Pagels et al. [Pag94] analyzed cache and TLB effectiveness in processing network I/O; Druschel [Dru93a, Dru93b, Dru94a, Dru94b, Dru96] and Kay [Kay93a, Kay93b, Kay95] s research focused on overhead and approaches of improvement in the traditional network subsystems; Wolman [Wol94] did latency analysis of TCP on an ATM network. These researches revealed that traditional network subsystems can not make efficient use of the core workstation hardware resources: the memory bus, the cache, and the CPU. Broadly speaking, there are two kinds of overhead in processing network I/O, one is data-touching overhead or perbyte cost and another is non data-touching overhead or per-packet cost. Mainly data-touching operations such as checksum and software copying contribute to the memory traffic because each byte of data needs to be touched and goes through the memory bus at least once. For example, checksum requires each byte to be read from the memory and copying requires each byte to be read from a source address in the memory and written back to a destination address. The non data-touching overhead mainly involves CPU processing, which includes transport protocol processing, context switching (processing switching and interrupt handling), data link protocol processing (dealing with the network adapter), buffer management, and so on. Data-touching overhead and non data-touching overhead scale differently with respect to the packet size. Data-touching overhead grows linearly with the size of the packets, and non data-touching overhead remains roughly the same. For this reason, people tend to use large packets to amortize the perpacket cost and thus to increase the end-to-end throughput, which will increase data-touching overhead. It is necessary to find an optimal packet size in this case. A. Data-touching operations Druschel [Dru94a] categorized the potential causes of the data touching operations into the following four categories: Device-memory transfers: Data must be moved between the main memory and network/device adapters. The techniques used to achieve device memory transfer are DMA and PIO. Cross-domain transfers: Protection necessitates the transfer of data between protection domains (i.e., address spaces). In the simplest case, the networking data are handled by a single application process running on the top of a monolithic kernel, and must thus cross the user-kernel boundary. In general, additional user processes such as window managers, multimedia servers, and OS servers of a microkernel operating system may introduce additional domain boundary crossings into the I/O data path. Three techniques are available to eliminate or reduce data copying in cross-domain data transfer, namely shared virtual memory, virtual page remapping, and copy-on-write (COW). Data manipulations: Data manipulations inspect and possibly modify every word in a network data unit. If the endto-end throughput of TCP/IP suite is considered, data manipulations mainly involve checksum. In general, however, most costly and complicated data manipulations like presentation conversions will happen at the presentation layer [Cla90]. Strategies used to improve the performance

4 of data manipulations include hardware support for data manipulation and integrated layer processing (ILP). Application Programming Interface: The application programming interface (API) defines the services and operations that an operating system provides to application programs. In particular, it defines the syntax and semantics of the operations (i.e., system calls) exported by the OS. I/ O data buffers appear in this definition as arguments to the operations. The argument passing semantics defined by the interface can have a significant impact on the efficiency of I/O data transfers between the OS kernel and applications. In particular, BSD API has the copy semantics such that networking data are copied to and from the application address space. That is, during a read system call, the application buffer is overwritten with the input data. After a write operation completes, it is assumed that output data has been copied from the application buffer, which can be reused by the application. The application invoking the write operation is blocked until the operation completes. Networking data usually go through the memory bus several times during the course of processing. For the example of a traditional BSD system at the sender side, the application writes the data into a buffer in its address space; the socket layer copies the data into a system buffer; the transport protocol reads the data to calculate the checksum; the data link layer copies the data to the network adapter. In total, the data crosses the memory bus at least six times (in which the TCP retransmission copy is not counted). Similarly, the data needs to cross the memory bus at least six times at the receiver side. Based on Kay, Pasquale [Kay93b] and Wolman et al. [Wol94] s latency analysis, checksum is more costly than user to kernel, kernel to user, and kernel to device software copying. However, copying from device to memory should be more costly than copying between user and kernel because both copyings do the same amount of work and cache should improve the later s performance. Data-touching operations are generally memory-bound. Both checksum computation and software copying forcibly bind the fast CPU with the slow memory and waste valuable CPU cycles in processing every byte of data. This will cause significant overhead, and lead to low throughput and high latency. And for this reason, increase on network load can easily saturates the CPU. In addition, checksum implementations may not be optimized for modern RISC processors [Kay93a]. RFC 1071, which describes the Internet Checksum, includes a generic checksum algorithm written in C and optimized algorithms written in the assembly language for three different machine architectures: a CISC microprocessor (the Motorola 68020), a vector supercomputer (the Cray), and a CISC mainframe (the IBM 3090). None of the algorithms described is well suited to RISC processors. The based algorithm relies on extended integer arithmetic support that is not usually found in RISC processors: a carry bit and an instruction that adds the carry bit to two operands. The Cray algorithm uses Vector operations, an even less common feature of RISC microprocessors. The IBM algorithm uses a number of branches which causes inefficient operation for pipelined processors such as RISC microprocessors. Kay and Pasquale made a number of improvements to the checksum algorithm such as reading memory in units of 32-bit words rather than 16-bit words, loop unrolling, use of pipeline [Kay93a]. B. Non data-touching operations Kay and Pasquale [Kay93b] categorized the potential causes of the non data-touching operations into the following six categories (based on a Ultrix4.2a, a BSD derived system): Data structure manipulation: It comprises manipulations of various data structures other than mbufs. These data structures are socket buffers, IP defragmentation queues and interface (device) queues. The socket buffer is a data structure in which a limited amount of data is enqueued either for or by a transport protocol such as TCP. UDP makes no use of the socket buffer structure for outgoing messages, and only uses it as a finite length queue upon reception. In contrast, TCP uses the socket buffer to implement the sliding window flow control on both the sending and receiving sides. The device queue is a data structure in which outgoing data are enqueued by the link layer until the network controller is prepared to process them, and incoming data are enqueued by the device driver until IP is ready by a software interrupt. Error checking: It is the category of checks for user and system errors, such as parameter checking on the socket system calls. Error checking can be further divided into two subcategories: assorted checks for errors within the socket layer and checks specifically for incorrect user arguments to the system calls for sending and receiving messages. Buffer management: Buffer editing (which is distinguished from data manipulations that require the inspection and/or modification of each word of data) can be expressed as a combination of operations to create, share, clip, split, concatenate and destroy buffers. When naively implemented, these operations may require physical copying of the buffers. However, in BSD UNIX, mbuf is used to store network data and manipulations inexpensively by using references where possible, therefore, buffer management is considered as part of the non data-touching overhead. All mbuf operations are in this category, among which allocation and freeing of mbufs are the most timeconsuming operations. Other operations include: copying mbuf chains (done to a certain extent with pointers), defragmentation (implemented by joining two linked list of mbufs), and message length checking. TCP spends more time in the above operations because it must make a copy of each message in case retransmission is necessary. Allocation and deallocation of mbuf are expensive because they need to be performed a number of times per message [Kay93b]. Although Wolman [Wol94] argued that this is an artifact of a particular management implementation choice rather than an inherent protocol behavior, Clark [Cla89] pointed out that buffer management can easily grow to swamp the protocol itself. Clark s analysis yielded around 10% buffer management cost in receiving a control packet. In [Kay93b], data structure manipulations and

5 mbuf manipulations together account for around 23% of total non data-touching processing time. Operating system operation: It includes support for sockets, synchronization overhead (sleep/wakeup), time needed for sleeping processes to start running again, operations of scheduling software interrupts to process incoming packets, software interrupt handler which dequeues incoming packets and directs their processing. The reports on operating system overhead [Cla89,Kay93b] are rather mixed, partly due to the difficulty to de-couple the system calls clearly from the rest of networking subsystem. But surprisingly, transfer-of-control operations in this category are relatively inexpensive. Protocol-specific operations: It includes protocol-specific work such as setting header fields and maintaining protocol state. This category is more narrowly defined than what is often called protocol processing. For example, although checksum is usually considered part protocol processing in TCP, UDP and IP, it is categorized separately as a data-tough operation because checksum is data specific but not protocol specific. This category may include operations in layers for device drivers, IEEE 802 encapsulation, ARP, IP, TCP or UDP, operations of finding protocol control blocks given a TCP or UDP header, operations for checking that a route already exists for a connection, and operations for setting up the protocol control block to reflect the current state properly. Although TCP is a complicated protocol, it is not the overhead source often observed in packet processing, and that it could support very high speeds if properly implemented [Cla89]. Despite TCP protocol s size, it only consumes 13% of the total TCP/IP processing time [Kay93b]. Clark and Wolman et al. estimated that a typical path through TCP is only about 200 instructions, and TCP itself could support throughput of 530 Mbps [Cla89, Wol94]. Other: This category includes all the operations that are too small to measure, such as symmetric multiprocessing locking mechanism. But whether or not it includes a significant portion of the operating system operations is unclear. Considering the large percentage it occupies in non data-touching overhead (22%), whether operating system operations are really not that expensive is inconclusive. Arranged according to the descending cost in overhead time, these six categories are: protocol-specific operations, others, mbuf operations, operating system operations, error checking, and data structure manipulations. In summary, the traditional network subsystem creates a lot of memory traffic during processing of network I/O, and it can not make efficient use of cache to reduce the memory traffic. Heavy load is put on the CPU at each step of network data processing by forcing it to not only perform per-packet processing, but also wait for the slow memory and perform perbyte processing. In practice, the poor performance of the network subsystem is usually a combined result of the above effects. The cost of data touching overhead grows with the size of packets. Majority of TCP and UDP traffic on a traditional LAN are small messages (less than 200 bytes), in which non data-touching operations dominate the network software processing overhead[kay93b]. Only a small fraction (16%) of the total time spent on processing TCP messages is due to data-touching operations (checksum and data movement), and in the case of UDP, data-touching operations are of significant factors, but do not overwhelm the non data-touching overhead. Recent applications on high-performance networks tend to transfer large messages. Because non data-touching overhead can be amortized over the large messages, data-touching overhead is becoming more predominate. Obviously, datatouching operations such as checksum and software copying should be the target of optimization. Optimizing a single datatouching operation will produce a large improvement in performance. Unfortunately, it is not the case for non data-touching operations. Overhead time is more evenly spread among the non data-touching overhead. Reducing a single non datatouching overhead, such as TCP protocol-specific processing, does not have a relatively significant effect on overall performance. Thus, a wide range of optimizations to non data-touching operations would be needed to produce a significant performance improvement. IV. Current Techniques Significant research efforts are appropriately devoted to optimizing data-touching operations, primarily the copying operations. A. Checksum Checksum is arguably the most time-consuming datatouching operation, numerous approaches are proposed to deal with it, in addition to optimizing the checksum implementation [Kay93a, Cha93]: Merge of copying and checksum to reduce the memory traffic [Cla82, Cla90, Jac93, Cha93, Wol94]. Instead of being implemented as separate routines and called separately, copying and checksum can be combined as a single operation. This operates word-by-word by reading a word into a register, adding it to a sum register, and then writes the word out to its destination. This approach saves a read operation over performing them separately. Checksum off-loading by performing checksum on the network adapter [Kle95, Chu96, Gal99]. This approach requires extra adapter hardware or extra processing power from the network adapter CPU. For example [Kay95], DEC s FDDI adapter without checksum off-loading support is based on a relative simple 16-bit MC68000 processor, while SGI s FDDI adapters supporting checksum offloading require a more sophisticated 32-bit AMD Checksum elimination [Kay93a, Dru94b, Wol94, Rod97]. On many local area networks such as Ethernet, FDDI and ATM, CRC already exists at the link level, thus it has been suggested that the Internet checksum is redundant and therefore could be eliminated assuming no corruption occurring at the host machine. However, this approach is only suitable for local area traffic where packets go from a source host to a destination without passing through any IP routers. It is unwise to turn off the checksum protection in any wide area contexts without considerable study.

6 B. DMA vs. PIO Several techniques exist to avoid data copying to reduce the memory traffic, and software copying between the memory and the network device can be avoided by using DMA instead of PIO. With PIO, the CPU is directly responsible for moving data between the adapter and the main memory: to send a frame, the CPU sits in a tight loop that first reads a word from the memory and then writes it to the adapter; to receive a frame, the CPU reads words from the adapter and writes them to the memory. With DMA, the adapter directly reads and writes the main memory without any CPU involvement; the CPU simply gives the adapter a memory address and the adapter reads to (writes from) it. When DMA is used, the adapter reads and writes host memory, and does not need to buffer frames on the adapter (except a few bytes of buffering to stage data between the bus and the link). The CPU is therefore responsible for providing the adapter a pair of buffer descriptor lists; one to transmit out of and one to receive into. A buffer descriptor list is an array of address/length pairs. When receiving frames, the adapter uses as many buffers as it needs to hold the incoming frame. Separate frames are placed in separate buffers, although a single frame may be scattered across multiple buffers. The later feature is usually referred to as scatter-read. In practice, scatter-read is used when the network s maximum frame size is so large that it is wasteful to allocate all buffers big enough to contain the largest possible arriving frame. A mechanism would then be used to link together all the buffers that make up a single frame. Output works in a similar way. When the host has a frame to transmit, it puts a pointer to the buffer that contains the frame in the transmit-descriptor list. Devices that support the so-called gather-write allow the frame to be fragmented across multiple buffers. In practice, gather-write is more widely used than scatter-read because outgoing frames are often constructed in a piece-meal fashion, with more than one protocol contributing to a buffer. By the time a message finishes the protocol stack and is ready to be transmitted, it consists of a buffer that contains the aggregate header (the collection of headers attached by various protocols that processed the message) and a separate buffer that contains the application s data. Scatter-read and gather-write allow DMA transfer from/to non-contiguous physical page frames, which greatly simplifies the physical memory management in OS, and helps avoid copying data into contiguous storage. In the case of PIO, the network adapter must contain some amounts of buffering to hold the frames that the CPU copies between the main memory and this adapter. The basic fact that necessitates buffering is that, with most operating systems, one can never be sure when the CPU will get around to do something, so has to be prepared to wait for it. One important question that must be addressed is how much memory is needed on the adapter. At least one frame of memory is needed in both the transmitting and receiving directions. In addition, PIO based adapters usually have additional memory that can hold a small number of incoming frames until the CPU can get around to copying them into the main memory. Although the computer system axiom that memory is cheap would seem to suggest putting a large amount of memory on the adapter, this is impractical because the adapter memory must be of the more expensive dual-ported type since both the CPU and the adapter read/write it. PIO based adapters typically have something on the order of 64 to 256 KB of memory, although memory on certain adapters is as much as 1MB. In PIO, scatter-read and gather-write must be implemented in software. The following summarizes the advantage and overhead of DMA in contrast of PIO, besides their requirements on buffering [Ste94]: DMA requires only one transfer over the memory bus, whereas PIO requires two. Using DMA thus reduces the memory traffic. Most high-speed memory buses depend heavily on the use of large-burst transfers to achieve a high throughput, which is compatible with DMA. When the CPU is involved as in the case of PIO, their capacity is often limited to moving single words or small bursts, thus limiting the network throughput. DMA can work in parallel with computation on the CPU, which means that fewer cycles are lost for the application because of communication. However, how much overlap can be achieved depends both on the available memory bandwidth and on the application s cache hit-rate, because contention for the main memory access may induce processor stalls during periods of heavy DMA traffic. The use of DMA can result in incorrect data being transferred as a result of inconsistencies between the cache and the main memory. The DMA engine fetches old data from the memory if new data only appear in the cache (only in systems with write-back cache). The software must provide the consistency by flushing the cache data before starting DMA. A similar consistency problem occurs on receive, which can be avoided with cache invalidation. Both cache flushing and invalidation are expensive and can reduce system throughput significantly. Fortunately, modern workstations often guarantee hardware cache consistency because of their support of multiprocessing. When data are transferred to and from the user domain in DMA, the pages holding the data must be locked in memory for the duration of the transfer. This is called page-wiring or page-pinning. Per-operation locking is not necessary for DMA access to dedicated-system buffers or shared-user-system buffers, since they can be locked permanently. DMA s asynchronous data copying adds overhead, because the CPU and the adapter have to synchronize at the end of a DMA session, typically using an interrupt. The trade-off between using DMA and PIO depends heavily on the host architecture, the software, and the circumstances in which they are used. Although DMA can achieve a higher peak throughput than PIO and is widely used in highspeed network adapters nowadays, there are situations in which PIO is preferable over DMA. First, computations that

7 occur in the kernel, such as checksum, can sometimes be integrated with the PIO data movement, saving one trip to the main memory. Second, after a programmed data movement from the network adapter to the main memory, the data are in the cache. This can result in reduced memory traffic if the data are accessed while they remain in the cache. In summary, PIO has typically a higher per-byte overhead, while DMA has a high per-transfer overhead; as a result, PIO is typically more efficient for short transfers, and DMA for longer ones. C. Cross-domain data transfer Implementations of cross-domain data transfer in the traditional network subsystem incur high overhead, it is essential to realize efficiently cross-domain data transfer (in the simplest case: user/kernel data transfer). The design of a crossdomain data transfer can be analyzed based on its transfer model and semantic, transfer method, and data structures [Pas94]. Three models exist for transferring data between domains, and each one has certain disadvantages: In the copy model, data are copied from one domain to another, that is, the original still resides in the source domain and an exact copy resides in the destination domain. While semantically simple and flexible, the copy model is difficult to be implemented efficiently. In the move model, data are removed from the source domain and placed in the destination domain. In this model, the source domain loses the data after the transfer. To avoid losing it, a process in the source domain would have to make a private copy before the transfer (or somehow arrange to have the data transferred back, making the loss temporary). In the shared model, after the data is transferred, processes in both the source and destination domains have access to the same data. Any modifications made to the data by processes in either domain are visible to processes in the other. This model has the disadvantage that after the transfer, modifications on the transferred data by a process in one domain can affect processes in the other domain which depend on the data. Since these modifications are asynchronous among the processes, their explicit synchronization may be required. Such coupling of the source and destination domains increases both the programming complexity and the chance of error propagation across domains. Two transfer methods exist for transferring data between domain, each having its applicability and limitations: The physical transfer method involves moving data in the physical memory (i.e., moving each word of the data from the source domain s physical memory to the destination domain s physical memory). Physical transfers promote flexibility. Because the transfer size granularity is the byte (or possibly, the word), we can transfer data from any location and of any size to a destination space that can begin at any location and whose size is exactly the data s size. However, the flexibility provided by physically transferring data is overshadowed by its primary disadvantage: high overhead in time and space. Physically transferring a word involves two memory accesses, forces the CPU to wait for the slow memory, increases latency and reduces throughput. In fact, during cross-domain data transfers, in most cases, the domains do not require access to the data; and the cost of transferring data should depend on the cost of accessing it. Furthermore, the amount of physical memory used during a physical transfer is twice the size of the data object being transferred (enough to store at the destination while reading from the source). Transfers must be delayed if sufficient physical memory is temporarily unavailable. Both of these factors degrade performance. The virtual transfer method involves moving data in the virtual memory. In other words, the transfer maps a region in the destination domain to physical pages that contain the data to be transferred, which are already mapped in the source domain. These physical pages are referred to as transfer pages. (The above discussion assumes the pagebased virtual memory architecture used in most modern operating systems.) With virtual transfers, the transfer size granularity is the physical page. The transferred data must be contained in the one or more transfer pages exclusively; these pages can contain no other data. For reasons of data privacy, the kernel must clear the unused portion of a newly allocated buffer page, which can incur a substantial cost, relative to the overall overhead of buffer management. A similar problem occurs if headers at the front of a received network packet contain sensitive data that must be hidden from user processes. Since entire pages are the actual units of transfer, the data s destination address must be at the same relative offset from the page boundary as the source address. Moreover, the size of the destination space must be the size of the number of transfer pages, which is usually greater than the data s size (unless it happens to be exactly a multiple of the page size). Therefore, the partial use of memory pages also requires more pages per amount of network data, resulting in increased physical memory consumption, page allocation and remapping overhead, and demand for TLB entries. Physical transfer is generally applied to the copy model. It does not make much sense with the move model, since erasing the data in the source domain simply adds cost. When the physical memory of the source and destination domains is separate, physical transfers in the share model require an underlying process to keep the copies in the memory consistent. If physical transfer is too costly, an alternative is virtual transfer, which can be either virtual copy or virtual move. Virtual copy, conforming the copy model, maps a region in the destination domain to the transfer pages while not affecting their mapping in the source domain. If processes in both domains only read the data, the virtual copy is as good as a physical one. If a process tries to modify the data, a physical copy of the page containing the data is made so the modifications do not affect the data in the other domain. This is also call copy-on-write. Accent supports copy-on-write [Fit86] for inter-process communication. It first maps message data from the sending process domain to kernel virtual memory, then to the receiving process domain. A physical copy is made by the kernel when the sending process modify the pages

8 before the receiving process receives it. Implementations using the copy-on-write mechanism are often complex compared to physical copying. And furthermore, servicing a copyon-write fault and physically copying the data are wasteful when the source domain does not need the data like the case in processing network I/O. Virtual move, conforming the move model, unmaps transfer pages from the source domain and maps them into a region in the destination domain. Unlike in virtual copy, no copy-onwrite mechanism is necessary, making this scheme relatively simple and efficient. Tzou and Anderson used virtual moving with optional lazy page remapping in inter-process communication in DASH [Tzo91]. In their design, a special region called IPC region in each domain is used. When a message is sent, the ownership of its data pages is transferred from the sender to the receiver, and the data pages are assigned the same virtual addresses in the receiver domain as they had in the sender s. In this way, the message is kept in the IPC region. Druschel and Peterson s experiments indicate that virtual move with page remapping is still memory bound since the CPU is stalled waiting for the cache to fill approximately half of the time [Dru93c]. This is likely to remain the case as the gap between CPU and memory speeds widens. Both virtual move with page remapping and virtual copy with copy-on-write require careful implementation to achieve low latency. Since virtually all modern operating systems employ a two-level virtual memory system, mapping changes require the modification of both the low-level, machine dependent page-tables, and the high-level, machine-independent data structures. (Lazy page remapping means high-level data structures are always updated on remapping, and a page is added to the low-level memory map of a virtual address space on demand by the page fault handler). Moreover, most modern architectures require flushing of the corresponding TLB entries after a change of mappings. The time it takes to switch to the supervisor mode, acquire necessary locks to VM data structures, change VM mappings (perhaps at several levels) for each page, perform TLB/cache consistency actions, and return to the user mode poses a limit to the achievable performance. Also note, the time for a page remapping operation increases with the size of the mapped data. D. Shared virtual memory In the shared model, shared virtual memory allows regions in both the source and destination domains to be mapped to the same transfer pages. This avoids data transfer and its associated costs altogether by statically sharing virtual memory among two or more domains. For example, the DEC Firefly RPC facility uses a pool of buffers globally and permanently shared among all domains [Sch90]. Since all domains have read and write access permissions to the entire pool, the following problems exist [Dru93a]: globally shared memory compromises security; pair-wisely shared memory requires copying when data are either not immediately consumed or are forwarded to a third domain; and group-wisely shared memory requires that the data path of a buffer is always known at allocation time. All forms of shared memory may compromise protection between the shared domains. Data transfer between domains must be organized in a way upon which domain processes agree. The organization can affect the performance by forcing physical copies, either in preparing for or during the transfer. Of the three data structure organizations (i.e., unstructured, structured, and semistructured) we consider, the simplest and most common one is an unstructured array. Only raw data are transferred and no other information, such as pointers, impose further structures. However, the transferred data are often not in the form of a single array. It may be stored in pieces organized by a more complicated data structure such as a tree. Consequently, before the transfer, the data must be linearized (i.e., physically copied) into a single array containing only raw data such as in the case of kernel-user copying when receiving network data. Thus, if processes agree upon the unstructured organization for transferring data, physical copying generally takes place in preparation for the transfer. Structured data have some special organizations and may include pointers in their structures. In particular, the transferred data are not stored in a single array, so destination domain processes must have methods to access the data according to the structures. Generally, potential receivers do not know these methods, which must also be communicated. Furthermore, if the data are embedded with pointers that are virtual addressed, a translation is required in the new address space. Semistructured data consist of pointers and raw data arrays, with each pointer referring to a raw data array. This organization avoids the disadvantages of unstructured and structured organizations. It does not have the single-array requirement of unstructured data, so linearization is not necessary. While semistructured data have associated pointers, these pointers are not embedded in the raw data and can thus be located quickly for possible translations. Furthermore, the access method is commonly known and does not have to be specially communicated. A semistructured data organization has additional benefits. We gain the flexibility of locating raw data at arbitrary locations within pages that contain no other data, which we can then use as transfer pages for virtual transfer methods. Moreover, semistructured data match well to the scatter-read gather-write DMA. E. Other Data-Touching Operations As we pointed out, data-touching operations may also include presentation conversion (encryption, compression, etc.) and OS kernel, servers or application processing (in a microkernel system). General strategies to improve performance of data-touching operations include hardware support and Integrated Layer Processing (IPL). Hardware support for data manipulations can reduce the CPU load, and when properly integrated, reduce the memory traffic. Hardware supported checksum is an example, and another example is hardware video (de)compression, However, hardware support may be only a short-term solution, due

9 to its complexity requirements on hardware, which seems too constraining for innovative high-bandwidth applications. IPL confronts the traditional layered scheme of the communication protocol stack. Layered protocol suites provide isolation between the functional modules of distinctive layers. Isolation facilitates the implementation of subsystems whose scopes are restricted to a small subset of the suite s layers. Thus, each layer may have its own data storage and perform data manipulations independently. This creates a lot of unnecessary memory traffic. ILP can minimize memory references resulting from data-touching operations at different layers. It is a strategy for implementing communication software that avoids repeated memory references when several data-touching operations are performed. In IPL, the data-touching operations from different protocols are combined into a pipeline: A word of the data is loaded into a register, then manipulated by the data-touching operations while remaining in the register, then finally stored, all before the next word is processed. In this way, a combined series of data-touching operations only references memory once, instead of potentially accessing memory once per distinct operation [Cla90]. Another key advantage of IPL is the increased locality of data reference [Abb93]. The combination of checksum and copying in TCP/ IP is a degenerated case of ILP in that the two data manipulations belong to the same protocol. Issues need to be addressed during an IPL implementation include satisfying ordering constraints, accommodating awkward data manipulations, reconciling different views of data [Abb93]: Satisfying ordering constraints: Traditional protocol suites often impose precedence or ordering constraints that limit the opportunities for ILP. These constraints rule out simply extracting the data manipulations and integrating them. One example of ordering constraints is that many data manipulations can only be performed once the data unit is in order. The protocol must make sure that the data are in order, at least within a certain range, before performing the manipulations. Another example is part of the data must be extracted from the network before they can be demultiplexed and it is very hard to combine extraction with other data-touching operations, except perhaps error detection. Demultiplexing complicates the re-ordering of operations across layer boundaries. Accommodating awkward data manipulations: Different protocol data manipulations may require different sized units of data, and some can change the quantity of data. Reconciling different views of data: A single message looks quite different at different layers in a series of protocols, as layers add or remove headers. Hence, adjacent protocols do not share a common definition of what data to manipulate (e.g., one protocol s data are another protocol s header, and is nonexistent to a third protocol). ILP may also be used to address the multiple memory access and poor locality problems that arise in a microkernel system. However, since software modules reside in different protection domains, the cost of transferring control across a domain boundary each time a word of data is passed is greater than the savings due to the eliminated memory accesses. Efficient ways of performing integration has yet to be developed for this case. One suggestion [Abb93] is to minimize the number of address spaces in which data manipulations are applied to a given message. Since certain data manipulations such as presentation formatting may have to be located in the application s address space, it is further suggested that all data manipulations should be located in the application. F. Additional Issues Buffer management. Buffer management schemes depend on the design decision of data structure organizations for the transferred data. In other word, to reduce physical copying, buffer management should provide an abstract data type that represents the abstraction of a single, contiguous buffer. An instance of such an abstract buffer type might be stored in memory as a sequence of not necessarily contiguous fragments. Both mbuf and x-kernel message [Dru93c] are buffer management schemes for this kind of semistructured data structures. Application program interface. As to the Application Program Interface (API), three problems are associated with Unix read and write system calls, as they related to avoiding data copying. These system calls allow data buffers with an arbitrary alignment and length, require contiguous data buffers, and have the copy semantics. An API interacts closely with the semantics of crossdomain data transfer and internal representations of the data. In practice, an implementation can choose either keep the traditional Unix API or change it. For example, an implementation could use an API with the shared semantics [Dru94a]. By changing it, the implementation loses its compatibility for legacy applications, which have to be re-implemented to work. Integrated design. Currently, most researches focus on the so called host interface design to improve the end-to-end performance of the TCP/IP suite in a traditional monolithic kernel, without an integrated design. However, API, crossdomain data transfer facility, and buffer management must be integrated in a manner such that their subtle interactions are taken into account. Consider, for example, a system in which its buffer management is restricted to the kernel, a virtual copy facility is used for the cross-domain data transfer, and the operating system supports a UNIX-like API. In this case, data units from the source device are placed in main memory buffers, and some buffer editing occurs as part of the in-kernel I/O processing (e.g., reassembly of network packets). When a data unit represented by a semistructured buffer reaches the user/ kernel boundary, it must be linearized, despite the use of a virtual copy facility. The reason is that the interface defines data buffers to be contiguous. Since the API allows applications to specify an arbitrarily aligned buffer address and length, the buffer s first and last address may not be aligned with page boundaries. Consequently, the data transfer facility may be forced to copy the portion of the first and last page that are overlapped by the buffer.

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed