Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors

Size: px
Start display at page:

Download "Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors"

Transcription

1 Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors Mike Schlansker, Jean Tourrilhes, Sujata Banerjee, Puneet Sharma Hewlett Packard Labs HPE Keyword(s): NFV; Cluster; Datacenter; Non-Coherent Shared Memory Abstract: This technical report describes a datacenter-scale processing platform for Network Function Virtualization. The platform implements high performance messaging on a non-coherent shared memory fabric. This includes both features for fast messaging as well as get and put operations that implement high performance remote memory access at datacenter scale. The report explores NFV as an important application that can be accelerated on this platform. External Posting Date: April 28, 2016 [Fulltext] Internal Posting Date: April 28, 2016 [Fulltext] Copyright 2016 Hewlett Packard Enterprise Development LP

2 Network Function Virtualization and Messaging for Non-Coherent Shared Memory Multiprocessors Mike Schlansker, Jean Tourrilhes, Sujata Banerjee, Puneet Sharma Hewlett Packard Enterprise Labs 1. Non-Coherent Shared Memory Multiprocessors There are a number of advantages to multiprocessor hardware architectures that share memory. In these architectures a large number of processors share memory to support efficient and flexible communication within and between processes running on one or more operating systems. At small scale, this established and mature shared memory multiprocessor (SMP) technology is used in multi-core processor chips from multiple hardware vendors. When shared memory is deployed over a very large number of processors, significant benefits are possible. Large-scale shared memory machines offer the potential for finegrained non-volatile data sharing across large systems that is not possible with a traditional cluster computers using fast networks. These systems exploit principles of Memory Driven Computing (MDC) [Bresniker] which uses large-scale, persistent, and word addressable storage to support important big data processing applications. MDC architectures allow high-performance word-level access to a large persistent data store that is shared across many compute nodes. These shared memory architectures exploit benefits of emerging word-addressable non-volatile storage devices such as memristor. However, potential benefits for large-scale shared memory comes with significant obstacles. Previous coherent shared memory architectures such as the SGI Origin [Laudon] and Stanford DASH [Hennesy] have developed principles for scaling coherent caches across large multiprocessor systems. But, years of experience have shown that preserving coherence across large memory fabrics introduces performance limitations. Cache coherence using snooping or distributed cache directories requires complex cache management protocols that require messages which invalidate or update copies of cached data whenever new data is written. The performance of such large shared memory systems is hard to predict and often disappointing as programmers fail to understand hidden bottlenecks that arise when they write parallel code that generates significant cache coherence traffic. Programs run slowly while they wait for hidden coherence signals needed to preserve a common view of shared data across the distributed hardware. Prior work has investigated the design of non-coherent multiprocessor systems [Yang]. We are investigating this style of architecture as we explore non-coherent systems similar to the architectures explored in the Hewlett Packard Enterprise The Machine project [Packard]. These architectures have the potential to provide high performance with much larger communication scale than traditional shared memory machines. In this report, we explore communication and networking software for similar large-scale non-coherent shared-memory memory architectures. Figure 1 illustrates a non-coherent shared memory multiprocessor system. The system includes multiple shared memory multi-processor nodes where each node is a coherent 1

3 shared memory multiprocessor along with its local RAM. Each node is a conventional multicore computer system that runs an operating system with possible virtualization software for guest VMs. In addition, large-scale non-volatile RAM storage is attached to the fabric. Each node provides a node-to-fabric interface which connects all nodes in a single loadstore domain through a Non-Coherent Memory Fabric (NCMF). When loads and stores are executed on the same SMP node then conventional memory coherence is preserved between these operations. A load operation on any processor within the node can addresses any given local or non-local memory address. The load will always see any value resulting from a prior store to the same memory location. This coherence property arises due to SMP hardware that ensures that all caches within the SMP preserve a common up-to-date view of memory. But, memory behavior is more complex when loads and stores are executed on distinct nodes because the non-coherent memory fabric design avoids high hardware cost and lost performance needed to provide fabric-wide coherence. To Other LSDs Ethernet Switches Ethernet Fabric... Ethernet Switches To Other LSDs Ethernet Interface Ethernet Interface SMP Node Coherent SMP Fabric Interface Fabric Switches Mem Intf Non- Volatile RAM Local RAM Local RAM Non-Coherent Memory Fabric (NCMF) Load-Store Domain... Coherent SMP Fabric Interface Fabric Switches Mem Intf Non- Volatile RAM SMP Node Figure 1 Non-Coherent Shared Memory Multiprocessor Non-coherent shared memory allows simplified hardware. As a result, inter-node communications requires that complex memory coherence operations must be explicitly programmed in software. Store operations executed on a source node may not be visible to load operations executed on a destination node unless explicit cache operations are executed to enforce the flow of data from the source node s cache through the fabric and to the desired memory location. Non-coherent shared memory makes expensive cross-node coherence transactions visible to programmers. If programmers fail to understand the flow of data between the distributed caches within the compute nodes, subtle and difficult to detect program bugs will arise. But, with deep knowledge of data flow, the programmer is positioned to modify programs to reduce the number of explicit cache flush, cache invalidate, or other operations needed to force the exchange of data between processors. This can improve program performance and scalability. 2

4 Each node provides an Ethernet interface to support global scale communications. Ethernet may be required for communication across large datacenters that exceed the appropriate scope for a non-coherent shared memory architecture. We plan to integrate communications software for shared memory with communications software for traditional network communications in order to provide an architecture that exploits noncoherent shared memory benefits and yet still achieves arbitrarily large scale. This document currently focuses on shared memory communications. 2. NFV for Large Non-Coherent Multiprocessors Future non-coherent multiprocessor architectures provide a number of important features that may prove useful for future NFV processing needs. First they are based on a coherent shared memory multiprocessor (SMP) single-node building-block. These SMPs run multiple hardware threads of execution which share local DRAM and on-chip caches. The single node SMP provides a high performance and parallel single-chip processing platform for NFV processing. Threads of execution share memory efficiently through onchip processor caches and local DRAM. Techniques such as avoidance of packet copying and passing packet streams by pointer can be used to design efficient single node NFV processing systems. It is critical that any efficient NFV is based on a powerful and efficient single-node NFV processor. But, conventional SMPs do not scale. For NFV applications that require complex packet processing at very high data rates, NFV systems will require the use of many nodes (or SMP processors). The traditional way to scale processing beyond a single node is to create packet processing pipelines that stream across Ethernet between nodes. This requires packet copying and expensive processing between nodes and results in wasted CPU cycles and wasted memory space needed to transfer and copy packets. An alternative approach is to use the enhanced scalability of non-coherent shared memory systems to define a large-scale system for NFV processing NFV Processing System Architecture Figure 2 below shows an abstract system that is similar to Figure 1, but reorganized to identify important sub-systems. For simplicity, Ethernet connectivity is not shown. We define our NFV Machine (SMPM) as a conventional shared memory multi-processor as previously used by the NetVM project [Hwang] to process chained NFV services. The SMPEXM extended machine is a larger non-coherent system that we use to explore function chaining over a large-scale non-coherent shared memory. We often refer to the SMPMs as nodes within a larger SMPEXM system. We use these definitions to distinguish between coherent modest scale NFV parallel processing and non-coherent large-scale NFV processing techniques. Modern high performance computing systems use InfiniBand or Converged Ethernet to implement lossless messaging transport across large datacenter network fabrics [Vienne]. These lossless fabrics support RDMA-style programming models that we adopt in this work. We hope to improve on this style of remote memory access using closer integration of memory access capabilities into computer system hardware and software. Prior research also explored supporting messaging and RDMA-style communications 3

5 using Ethernet NIC that are attached directly to the coherent memory of a multiprocessor chip along with a conventional Ethernet network [Schlansker]. Our SMPEXM system is similar to a Scale-Out NUMA design presented in ASPLOS 2012 [Novakovic]. However, at the lowest hardware level, our design uses native hardware support for non-coherent remote memory operations. This includes direct hardware support for processor initiated load and store operations that are mapped to remote physical addresses and traverse the SMPEXM non-coherent memory fabric as supported in The Machine [Packard]. Fabric NVM Non-Coherent Memory Fabric Fabric NVM OS Instance Multi-CPU processor OS Instance Multi-CPU processor OS Instance Multi-CPU processor Local RAM Local RAM Local RAM SMPM SMPM SMPM SMPEXM (Extended Machine) Ethernet Ethernet Ethernet Figure 2: Extended Machine - SMPEXM Sections 3-9 below provide describe the ZMSG software architecture which is designed to implement messaging for large non-coherent shared-memory multi-processors. ZMSG is intended as a general tool for multiple applications. Section 10 describes techniques to accelerate NFV service chaining using ZMSG with a large non-coherent shared memory hardware system. The remainder of this section outlines some of the research opportunities for optimized NFV processing that can be explored on this architecture Eliminating Copies in Shared Memory Systems Memory-based inter-node communications has the potential to enhance overall processing efficiency. We are developing communications techniques that are built on shared memory which use fast messaging or other approaches for inter-process communications among nodes in a non-coherent memory system. This eliminates software overheads needed to pass data through an Ethernet device. Memory based transport is lossless eliminating the need for software to retransmit control messages or any other messages that cannot be dropped. Node to node copying of packets is particularly wasteful when most of the packet data is carried along and only required in rare cases where deeper inspection is needed or only when the packet reaches a final output to be retransmitted back onto a network. Network Functions (NFs) sometimes process the packet header while packet data is copied without inspection. Often a few packets that start a new flow receive special inspection to 4

6 determine subsequent processing for all packets in the flow. In these cases, there are good opportunities to use shared memory to reduce unnecessary copying of packet data. Non-coherent-fabric architectures enable alternative NFV solutions. NVF architects can make careful choices regarding when to copy full packet data, or when to copy much shorter packet metadata along with a packet handle. The packet handle allows subsequent access to the full packet when needed. A packet handle can be passed from node to node within a non-coherent load-store domain and used, by any node in the packet processing pipeline. The handle can be translated to a local address and used to directly copy detailed packet data across the memory interconnect. Cross-fabric access is performed using software library functions that use low-level non-coherent memory operations to bypass local caches and reach directly across the non-coherent memory fabric to get data from its remote memory location. While not as efficient as local cache or local DRAM access, non-coherent access can be more efficient than copying packet data using high performance networking techniques, such as RDMA, to remotely access data Using Non-volatile Storage Traditional NFV systems use volatile DRAM to temporarily store high bandwidth packet streams. This data is transient data that cannot be stored for a substantial length of time due to a limited availability of DRAM storage. However, future NFV goals may include rapid write or read access to of large amounts of persistent data. Persistent data could be sample packets, sample flow traces, derived statistics, rule databases, or other packet processing measurements or instructions that are large, must be quickly accessed, and cannot be easily reconstructed if lost due to a power outage. New NVRAM technologies provide a combination of benefits not possible with prior memory and disk. When the amount of data that needs to be stored becomes excessive, DRAM becomes very expensive and consumes too much power. When extracted data is randomly stored or accessed from a database, disk seek latency imposes severe bottlenecks. Word addressable NVRAM devices circumvent key limitations and may provide new persistent data storage possibilities for packet stream processing. Often, packet processing is about finding rare events within a large volume of data much like searching for the proverbial needle in a haystack. When a packet stream enters the NFV system, little is known regarding the long-term need for the packet. But, as processing proceeds, a flow might be determined as malicious, it might be identified as a target of lawful intercept, or other triggers may identify a flow or set of packets as one that requires rigorous analysis and long term storage. NVRAM augments the storage hierarchy with new high-bandwidth and low-latency capabilities that can be considered for use in NFV systems. This allows rapid data storage and access that survives power outages or system restarts and may assist in some types of NFV processing Security in Shared Memory Systems Security is a critically important design feature for future large-scale computer systems. The consolidation of computing needs onto shared hardware infrastructure requires strong access control and partitioning as clients with conflicting business interests often share hardware. We hope to better understand the secure processing needs for NFV 5

7 systems. We hope to explore security needs for NFV and how this may relate to secure communication APIs, secure system-wide memory allocation, or other security issues for large-scale multiprocessor systems that relate to NFV. 3. Messaging is important A key component of our architecture work focuses on implementing high performance messaging communications for large non-coherent shared memory multiprocessors. Efficient communications across the extended machine is essential to most tasks. Our low-level messaging architecture is designed to support both modern high performance messaging interfaces as well as traditional software interfaces such as TCP/IP sockets. A goal for this effort is to design and evaluate non-coherent memory architectures along with appropriate messaging interfaces as a high performance platform for important NFV applications and to tune and refine software and hardware prototypes for these applications Messaging simplifies programs Shared memory programs have been studied as a complex and powerful parallel programming abstraction for about half a century. Even after all this work, the construction of highly parallel programs using shared memory is generally recognized as a class of programs that are hard to scale, verify, and debug. Shared memory parallel programs often exhibit problems such as decreasing performance with increasing scale, race-conditions, deadlocks, starvation, and other hard to debug program errors. Techniques are needed that exploit shared memory hardware without unleashing the full complexity of shared memory programming across all applications. The construction of parallel programs using messaging is a trusted technology in both scientific and commercial applications that limits program complexity and helps us understand and predict program correctness and performance. The use of messaging does not preclude the use of more complex shared memory programming techniques that can still be used in critical program sections which are carefully written to exploit special shared memory benefits while avoiding shared memory pitfalls Messaging limits fault dependences When a multi-threaded program is developed using shared memory, the program is typically understood as a single fault domain. When any participating execution thread crashes during the execution of this parallel program, shared memory is left in an unknown state which adversely affects the execution of all participating threads. In one common situation, a thread needs exclusive access to modify an important object and locks access to that object in the shared memory. The thread then crashes before it completes the modification and unlocks the object. A critical resource is now permanently locked and unavailable for use by any other thread. Thus, the entire parallel program may deadlock after a single thread crashes. Distributed parallel systems are routinely constructed using cooperating processes with messaging to exchange data. When one of the processes crashes, the effect of this failure on other processes is easier to understand. Messages can no longer be sent to or received from the faulty process but normal execution continues for all of the other processes. 6

8 Thus, messaging is often used to simplify process interactions within the design of fault tolerant systems. When messaging is layered over shared memory, we still face difficulties in fault isolation among processes that use share memory through their use of messaging. But now, problems of shared memory fault detection and fault recovery are restricted to a smaller amount of carefully written code within a messaging library. The construction of a fault tolerant messaging library is a critical goal that is not yet adequately addressed within our work. This is an area where traditional Ethernet hardware based messaging has significant advantages. 4. The ZMSG Messaging Architecture We are developing the ZMSG messaging software architecture which implements high performance lossless messaging for large-scale non-coherent shared-memory multiprocessors. The central role of ZMSG is to provide high performance inter-node communications that can transport messages across non-coherent shared memory and can be extended to incorporate node-to-node transport across Ethernet or other networks. ZMSG defines a low-level API on which a number of higher-level communication services will be layered. Goals include high performance, scalability, and security. The ZMSG design is evolutionary in nature and early goals pursue a software only approach for message delivery using operations such as load, store, cache flush, and memory atomic operations that are performed by today s multicore processors from multiple vendors Global Shared Memory Overview ZMSG is designed for large non-coherent shared memory systems. For scalability, local communications within an SMPM and global communications across SMPMs are treated differently. For local communications the improved performance and simplicity of a fully-coherent shared memory can be used. For communications among nodes, care must be taken to accommodate non-coherent shared memory limitations. SMPM Node Node Manager Cluster Manager Global Memory Allocation Node Manager SMPM Node User Process User Address Space User Address Space User Process Kernel Service Kernel Address Space Shared Physical Address Aperture Kernel Address Space Kernel Service Figure 3 Global Shared Memory 7

9 Since no specialized NIC or other hardware is currently planned to assist ZMSG in its communication tasks, ZMSG runs using kernel processes that provide autonomous capabilities that can be overlapped with user execution. ZMSG uses one or more CPU cores within multiprocessor nodes to run kernel processes that act to provide autonomous network capabilities which are provided in a conventional system using NIC hardware. Figure 3 provides an overview of shared memory as used by the ZMSG system. A cluster manager controls interactions between a multiple SMPM nodes. We define an aperture as a contiguous range of memory addresses that can be used by each process to access shared memory locations that are accessed by other processes through their own apertures. Software is written to control memory management hardware and to create these apertures specifically for the purpose of cross-node communications. The Cluster manager performs all tasks needed to coordinate the use of apertures among the ZMSG nodes. A shared physical address aperture can be mapped into each of the nodes to support communications between all nodes. One function of the cluster manager is to acquire a physical memory aperture from a global memory allocation service and to make the same aperture available to kernel software running on each of the ZMSG nodes. This can provide non-coherent memory access to physical memory that can be shared among all of the nodes. It should be possible to build a parallel and distributed cluster manager for very large clusters. To achieve adequate management performance, this may involve the introduction of hierarchical management or some other means to decompose cluster management into independent parallel tasks. Each node runs a node manager, a ZMSG kernel service, and a number of user processes. The Node manager is responsible for processing local management requests. This may involve checks for correctness or security checks needed before setting up a connection. The kernel service provides shared services that are autonomous and not synchronized with user process execution. User processes are typical endpoints for ZMSG communications. The node manager is also responsible for opening up user and kernel virtual address apertures into the shared physical address aperture. A kernel address aperture allows protected access into data shared by many other systems. The user address aperture is created to support direct user-to-user communications that bypass all operating system interaction Four quadrants of communication performance. The ZMSG library is designed for good performance in four communication extremes. We can then refine this architecture for additional performance improvements. Figure 4 illustrates extreme behaviors in communication performance. Two vertical columns separate simple two-party communications from more complex all-to-all communications. In a two-party communications, a stream of messages is sent from a single transmitter to a single receiver. We call our solution architecture for this problem Bicomm. A simplified view of Bicomm is shown in Figure 5. 8

10 Short Message Two-Party Bicomm immediate All-to-all Datagram immediate Long Message Bicomm indirect Datagram indirect Figure 4 Communication Extremes Each ZMSG interface supports both send and receive operations. For each forward communication path we provide a return path which is needed to support message acknowledgement and buffer release management. A client desiring a unidirectional port can send and not receive on that port or receive and not send on that port if this is desired. ZMSG communications are in order and lossless. Send Interface User 1 Bicomm User 2 Receive Interface rcv q Figure 5 Simplified View of ZMSG Bicomm In more complex all-to-all communications, each receiver must be prepared to receive a message from one of many senders. If each receiver were limited to using a separate Bicomm interface to connect to each of many senders, then receivers would need to poll many interfaces while looking for each inbound message. Instead, we provide a lossless Datagram interface which allows each receive interface to be used as a common destination for many senders. Figure 6 presents a simplified view of Datagram. rcv q User 2 Send Interface rcv q User 1 Receive Interface rcv q Datagram rcv q User 3 rcv q User n Figure 6 Simplified View of ZMSG Datagram The two vertical rows in Figure 3 separate short-message from long-message communications. Short messages may be start or completion signals or other short command or data strings. Performance limits for short messages are usually measured as achieved low latency or achieved high message rate. Long messages are used to move a 9

11 large volume of data between program threads. Here, performance is often measured as an achieved high data rate. ZMSG uses immediate mode to move short messages and the indirect mode to move long messages. While the immediate mode passes the actual data contents through send and receive interfaces, the indirect mode passes a handle or pointer to data instead of actual data through the interface. The movement of data in the indirect mode will be performed, by a user copy loop, by DMA hardware, or by system software that mimics desired but missing DMA hardware ZMSG security overview We need an architecture that can ensure secure communications. Fine grained access controls are needed to ensure the authenticity of communicating parties as well as the privacy of communications. Our goal is to engineer a system where security features may penalize the performance of the connection setup process but should not penalize message transfer performance after connection setup is complete. Our architecture uses a software approach which assumes that while user software cannot be trusted, kernel software can be trusted. We assume that secure procedures will be developed to authenticate kernel software and ensure that kernel code is trusted. While future hardware may assist in providing additional security it may not allow complex fine-grained security policies for a large number of logical communication endpoints. At this time, ZMSG relies on kernel-level protection for shared memory access to limit the scope of data access to authorized clients. ZMSG does not currently use encryption. Of course any ZMSG client users can encrypt data before sending data over ZMSG. 5. ZMSG s Bicomm Two-Party Communications Bicomm provides high performance two party communications across ZMSG. This is done using OS bypass interfaces which allow direct user-to-user communications without operating system overhead on each message send and receive. A secure protocol interacts with the ZMSG manager to provide access into a shared region that is mapped as an aperture into the address space of two users. Since trusted ZMSG kernel software provides two-party access to page-protected user memory, we can ensure that only authorized processes can access data through the shared user aperture provided by the ZMSG manager. Each of the endpoint processes accesses the shared region with library software that implements lock free queues using appropriate cache flush operations. A simplified Bicomm API provides easy-to-use send and receive message constructs that insulate the user from the low-level complexity of efficiently transferring cache lines across noncoherent shared memory Bicomm Security The protocol used to setup two-party Bicomm connections provides a first example for how all low-level ZMSG security mechanisms operate. Similar principles are used to secure ZMSG Datagrams, and ZMSG RDMA apertures. Figure 7 illustrates the procedure to setup a secure Bicomm connection. In step 1, user 1 creates a unique secret name for the new Bicomm connection. This can be done using a secure naming service or by generating a unique random name of sufficient length. User 10

12 1 then submits a Create Bicomm command along with the secret name to the ZMSG cluster manager requesting a new Bicomm channel. A secure communication path to the cluster manager is required for this transaction. The ZMSG manager verifies that the secret name is not already in use and, if all is ok, the ZMSG manager replies to this action with a port handle that can be used for high performance messaging. In step 2, user 1 then uses a secure key distribution mechanism ( ESP is shown in Figure 6) to provide the secret name for user1 s Bicomm channel to user 2 (and only to user 2). In step 3, user 2 submits a Join Bicomm command using the same name to the ZMSG manager. After checking to be sure that the reference Bicomm already exists, the ZMSG manager creates the second Bicomm port and returns a port handle to user 2. Setup is now complete, and either user is free to issue send and receive commands on their respective Bicomm ports. Either user can be confident that the parties sending or receiving messages on this Bicomm connection are the two parties that exchanged a secret name in step 2. Each Bicomm connection uses a memory region that is mapped into the physical address space of the SOCs for both communicating endpoints. Some subset of that memory region is mapped as a Bicomm port into the user address space for both communicating processes. ESP Step 2 Secret Lock Free Queues 2-User page within All Kernel Region User1 Code Send() Rcv() First Bicomm Port Second Bicomm Port Send() Rcv() User2 Code Port Handle Shared All Kernel Memory Region Create Bicomm Step 1 Unique Secret ZMSG MGR Port Handle Secret Step 3 Join Bicomm Global Memory Manager Figure 7 Bicomm secure connection setup 6. ZMSG s Datagram Service for Multi-Way Communications 6.1. Datagram overview The ZMSG kernel Datagram service provides the most flexible messaging (along with the largest overhead) among our ZMSG communication interfaces. Its flexibility arises from a user-to-kernel-to-user architecture. By passing all messages through trusted kernel software, we gain two powerful capabilities. First, shared queues are accessed by trusted kernel code. This contributes to the both the reliability and the security of the shared datagram service. Second, the kernel can provide autonomous hardware DMA 11

13 acceleration that is not synchronized with the sending or receiving processes. This is needed for autonomous transport such as RDMA. User User User Library Lport Logical Send/ Receive Queue Lport Logical Send/ Receive Queue Kernel Receive Service Local Lport Map Kernel Send Service Kernel Physical Receive Queue From other s To other s Coherent Shared Memory Remote Map Non-Coherent Shared Memory Fabric Figure 8 ZMSG Datagram Overview The kernel Datagram service implements lossless Datagram-style messaging. A single physical port () and multiple logical ports (Lports) can be deployed on each node. Figure 8 provides an overview of a node with its and its Lports. Each physical port provides a receive queue that can used as a destination for message insertion by many remote senders. Physical receive queues are accessible only by the kernel and provide a foundation for message transport across the non-coherent memory fabric. When a user places a message into an Lport send queue, a kernel send service observes the arrival of the message and begins processing. The identity of the destination physical port is determined and a remote physical port map is used to identify the address location of the remote. The send service on this node enqueues the data into the physical receive queue of the remote. This enqueue transaction is a carefully designed to support multi-threaded insertion over non-coherent shared memory. The message is now processed by the remote receive service. When a message is deposited into a physical receive queue, the receive service observes the message arrival and begins to processes the message. The destination Lport identifier is extracted from the message, a lookup is performed to find the address location of the Lport s receive queue, and the message is delivered. The message is now available in a receive queue within the receiving user s virtual memory Protocol to add secure s A ZMSG ring is defined as a collection of s that communicate with each other. The ring is used to establish kernel-level communications between multiple nodes. Each on a ring can communicate with any other on that ring. More than one ring can be defined as might occur when a datacenter is partitioned among multiple tenants. Shared memory hardware capabilities may limit each node s access to specific rings. When ZMSG is initialized, a secure protocol ensures that kernel access to each ZMSG ring is authorized. 12

14 Kernel Module Ring Step 2 ESP Cluster Manager Port Handle Create Port Ring Create Ring Step 3 Step 1 Ring Physical Clique (Non-coherent mem) Kernel Module ZMGR Ring All Kernel Book Kernel Module Ring Figure 9 Creation Protocol Figure 9 illustrates steps used to create and ensure secure access to a ZMSG ring. In step 1, a cluster manager creates a unique (possibly secret) ring name. A Create Ring command is sent to the ZMSG manager along with a suggested ring name and a reply indicates success. Step 2 uses a key exchange mechanism (shown as ESP ) to distribute the ring name over a secure channel to each of the kernel entities that is authorized to access the ring. In step 3, each of the kernel entities independently creates its own. Each kernel entity creates a name that is unique among s on the ring. Each kernel entity submits a create command with ring and name parameters to the ZMSG manager. The ZMSG manager creates the and returns a handle to the requesting kernel entity Protocol to add secure Lports After each kernel entity has created a, any client process that needs ZMSG communication services can create an Lport which serves as endpoints for Datagram communications. Figure 10 shows the Lport creation protocol. A client wishing to create an Lport submits a unique (possibly secret) Lport name, along with a resource request, to its. The replies with an Lport handle which can be used for subsequent high performance communications. The resource request specifies the size of the requested receive buffer. Large receive buffers are needed for large messages and for high fan-in communications when many senders may send simultaneously to the same receive port. While this completes the creation of an Lport, no communications with any other Lport has been authorized. Establishing an actual communications capability is shown the next section. 13

15 User code Create Logical Port Lport Handle... Lport Lport Res Physical Clique Many Lports per (Polled with coherent mem ) Figure 10 Lport Creation Protocol 6.4. Adding an Lport-to-Lport Connection A connection can be added between any two Lports that are attached to s on the same ring. Only a single connection can be added between a pair of Lports. Establishing a connection provides both an authorization to communicate as well as resources (credits) that are needed to communicate without potential data loss. ESP Step 1 User Code Open Remote Conn Step 2 RemID Lport Lport Res Lport Physical Ring Lport Lport Res Lport RemId Step 3 Open Remote Conn User Code Figure 11 Adding an Lport connection Figure 11 illustrates steps needed to create a connection between a pair of Lports. Connection creation begins in step one when both clients use some key exchange mechanism (shown as ESP ) to exchange and Lport names as a means to authorize remote access. In step 2 a client submits an open remote connection request to the Lport. The request provides the name of the remote, the name of the remote Lport, and a resource request to obtain buffer credits needed to send on this connection. The open connection request returns the local ID for the secure remote port. This local ID serves as a destination address for a send operation or as a trusted source address for a receive operation. The other client duplicates the step 2 procedure (on the other side of the connection) in step 3. After all steps are complete, bidirectional communications is established between the clients who exchanged keys in step 1. 14

16 6.5. Credit based flow control ZMSG supports lossless Datagram messaging and when a stream of messages is sent, we must guarantee that each message will be delivered. When many sending Lports transmit messages to the same receiving Lport, a mechanism is needed to delay senders that compete for access to a shared receiver which has limited receive bandwidth and buffer memory. Such lossless transport requires end-to-end flow control It is critical that any messages that have been deposited into the ring can be drained into target Lport buffers to prevent physical port congestion from causing a loss of service among users who compete for physical access. Credit-based flow control limits the rate at which new messages can be submitted to each sending Lport so that the physical ring can always be drained and every message in a receive queue can be moved into its destination logical port without dropping messages. Credits are managed in kernel software to ensure congestion-free transmission through the shared physical ring. Each Lport has a receive queue that provides a fixed number of message slots that are allocated when the Lport is created. Each slot can contain a single message whose maximal message size (in immediate mode) is limited to a number of bytes that is specified by a fabric MTU. When a receive buffer is shared among many senders, each sender needs a credit to guarantee an empty buffer slot at the remote target receiver before the message is accepted by the ZMSG message server and sent to that receiver. When the message is sent, the sender s pool of available credits is decremented by one. When the message is delivered, processed, and de-allocated, the credit is returned to the sender from which the message came. This refreshes the sender s pool of credits and allows a continuous stream of messages from senders to receivers. Dynamic adjustments to the number of credits that each sender controls are possible as each sender s need changes, but this feature is not yet implemented. In the future, receive buffer allocation and credit management could consider the size of each message. Currently, all messages are considered as maximal in size Cross-node signaling The ZMSG kernel Datagram service can support signaling across a non-coherent memory fabric among multiple SOCs. A client can send a signaling message having a destination address that specifies an Lport. The ZMSG kernel receive service processes the received message from the and deposits a message into the Lport receive queue to provide metadata associated with the signal. The kernel service thread can also sends a Linux signal to a client thread which implements a signal handler that may be suspended and waiting for that signal. While our current ZMSG API is not faithful to the InfiniBand verb specification, our signaling API tries to follow InfiniBand Verbs. Current software supports a completion queue as a component of each Lport. Each completion action associated with an Lport inserts a completion queue entry into the Lport s completion queue. A completion channel can be associated with the Lport. A process can poll the completion queue and, after polling an empty queue for some period of time, the process can wait for a signal (block) on the associated notification channel. 15

17 This reduces wasted CPU cycles associated with a user process which is reading an empty input channel. After a signal is received notifying the user process of a newly received message, the process resumes polling to receive additional messages that are now present Autonomous Transport (RDMA) The ZMSG kernel Datagram service is an appropriate platform for the deployment of autonomous data transport such as RDMA. A major obstacle arises for platforms that do not have memory mover (or DMA) hardware. While a kernel software thread of execution can be substituted for missing DMA hardware, a single software thread must provide copy support for multiple Lport clients. The performance of a shared copy service, which is implemented using a single thread of execution, may be disappointing. We advocate and look toward the incorporation of powerful DMA hardware into architectures supporting ZMSG. 7. ZMSG User-to-User Datagram A faster and simpler version of the Datagram service is also planned but not yet implemented. This service is called the user-to-user Datagram interface. Since this service eliminates kernel intervention, it can provide lower latency between its client ports. However, without kernel intervention, the service cannot implement autonomous RDMA transport and, since all code is untrusted user code, the interface cannot enforce selective security among its ports. A single shared memory region is mapped into the user virtual address space of a set of user processes. Each client invokes user library code to manipulate non-coherent data structures within this shared region. The invocation of a send command identifies the address for the destination receive queue, checks the receive queue for a full condition and, if there is sufficient space, pushes the new message into the queue. A receive command returns the latest message at the head of the receive queue. Here, the untrusted user code running at each SOC can be modified so as to perform arbitrary modifications on data structures in the shared region. Clients that share a memory region must trust each other to behave properly. The architecture cannot prevent denial of service, cannot selectively allow transport among pairs of ports, and messages are subject to falsification of source origin. The simplicity of the user-to-user Datagram service eliminates costly messaging overhead. Unlike the kernel Datagram service, each message is sent directly from a sending client thread to a receiving client thread without processing by any intermediate thread. A similar kernel-to-kernel Datagram service can also be defined. This is designed exactly as the user-to-user Datagram, except that all clients are kernel execution threads. This can be used to provide a general kernel-to-kernel messaging facility. 8. Indirect messaging modes The ZMSG indirect mode supports the transport of one or more object handles within a message from a sending process to a receiving process. The sending and receiving 16

18 processes potentially run on distinct nodes running separate OSs. A handle is a reference to a contiguous region of shared memory that can be translated into a starting virtual address by each process that references that region. Any word within the region can be addressed using an offset from the beginning virtual address that is identified by the handle. When a handle is initialized, a means is provided (e.g. a function or table entry) to translate the handle into a local virtual address suitable for accessing the object. When an region is shared across multiple nodes, a location independent handle can be sent between nodes and can be translated into a local virtual address that is used to load or store data inside the region. The rest of this section focuses on the use of handles for indirect-mode communication between concurrent threads that communicate across multiple nodes in a load-store domain. When using handles, the actual copying of data (if necessary) can be performed by user each user process rather than by a messaging service. Since each node runs many threads in parallel, copying can be performed with more parallelism than would be possible if a single kernel thread were used to implement a shared (software) DMA copy service. ZMSG messages can be used to support either a send-side or receive-side update model described in more detail below. Each of these communication models communicates across a load-store domain by making a change to shared data and informing one or more remote nodes about the completion of this data update ZDMA RDMA for ZMSG Figure 12 illustrates pseudocode which explains an RDMA interface we call ZDMA which is built on ZMSG to provide for system-wide communications. This augments messaging as implemented by IP sockets, MPI, or some other messaging library. Unlike messaging, ZDMA provides address-based remote memory access that can be used across large multiprocessor systems. Our ZDMA interface can be hosted on a variety of platforms since low-level RDMA transport can be implemented using coherent shared memory, non-coherent shared memory, or a networked cluster using RDMA hardware (such as RoCE or InfiniBand). The pseudocode shows concepts of local buffer registration, key exchange, and remote buffer registration that are needed to enable a connection which allows fast get and put access across a non-coherent shared memory cluster. Two SMPM machines lie within a larger SMPEXM cluster. The left-hand machine holds a B1 buffer that is local to its memory. In a first step, the B1 buffer is allocated by a function that returns its address (B1A). A Local registration defines the B1 buffer as a buffer that can be remotely manipulated using get and put operations. This registration provides a cluster-wide memory manager information about the buffer s global location within the cluster. Registration ensures that the physical location of the B1 buffer remains invariant so that cross-os access (which is unaware of any paging operations) can safely reference data. This local registration returns a unique handle (B1H) that globally identifies the buffer. The handle is sent though a message communication channel (e.g. using a TCP socket) to a process running on the SMPM2 machine. 17

19 SMPEXM Memory Manager B1A=Alloc(B1len) B2A=Alloc(B2len) B1H=L_Register(B1A, B1len) B2H=L_Register(B2A, B2len) MSG_Send(B1H) B1H=MSG_Rcv() (B1RA, B1len) = R_Register(B1H) B1 Buffer Store(B1A+offsetx, data) Load(B1A+offsety, data) Put(B2A, offset1, B1RA, offset2, len) B2 Buffer OS1 Get(B1RA, offset3, B2A, xoffset4, len) SMPM1 Figure 12 OS2 SMPM2 The pseudocode for the right-hand machine illustrates remote access into SMPM1 s B1 buffer. The right-hand machine allocates and registers a B2 buffer for use in subsequent get and put operations. In order for SMPM2 to access SMPM1 s B1 buffer, a handle exchange is performed as SMPM2 reads the handle value using conventional messaging. The B1H handle value can be used in a remote registration call to acquire address (B1RA) and length. While such remote addresses can be implemented using noncoherent load and store operations that directly manipulate remote memory, these lowlevel operations are too complex for most users. Instead Get and Put library functions are provided simplify this process. After both buffers are allocated and registered, get and put operations use handles to provide fast user-mode access to remote memory over the non-coherent fabric. Our pseudocode shows this remote access using a local update loop which updates buffer B1 on the left and a remote access loop which copies data between B1 and B2 on the right. The local update loop on SMPM1 accesses local memory using conventional coherent load and store operations through a pointer to B1 (B1A) in local memory. SMPM2 accesses remote memory using get and put operations that retrieve or insert data from or to remote memory. After all buffer registrations have been performed, fast user-mode get and put operations can be used to randomly access data within a remote buffer. Local and remote nodes are running asynchronously and additional tools are used to synchronize the exchange of data when needed. 18

20 8.2. Data Ownership A non-coherent update occurs when distinct nodes run execution threads that concurrently update a shared object stored within a non-coherent shared memory. These distributed updates leads to two serious problems. First, performance is lost when mutually-exclusive updates are enforced across non-coherent shared memory. Slow cross-fabric atomics are used to lock data objects for exclusive modification. Second, when any thread of execution crashes, any shared object that the crashed thread may have modified could be permanently left in a corrupt state. This makes it difficult contain faults and provide program fault tolerance. Per-node data-write ownership is used to eliminate distributed updates across a noncoherent shared memory. We say that a data object is owned by a single node if that node, and only that node, modifies data within the object. If each data object has a unique owner, then distributed updates are no longer needed. All requests to update owned data are executed within a single (multiprocessor) node allowing the use of more efficient local and coherent atomic operations. When a process within a node fails, data owned by that node may be left in a corrupted state. Data owned by all other nodes has not been corrupted. Developing techniques to promote data ownership can simplify fault containment and produce system that are more tolerant of failures Send-side Update (no ownership) In a send-side update, a sender updates globally shared data using store and flush operations that are executed by the sending node before an immediate mode message or signal is sent from the sender to one or more receivers to inform potential receivers of the update completion. A receiver can poll for update messages at a receive port or the receiver can block while waiting for a signal to indicate the update completion. For example, much like an RDMA put, a sending thread could copy data to update a remote object and then use ZMSG s messaging to signal the completion of the update. Without clear rules for data write ownership, fault tolerance may suffer. When node hardware or software fails, any object that can be modified by that node may be left in a corrupt state. Without clear ownership rules this may include almost any object in the system Receive-side Update (owned by receiver) A sender decides to update shared data that it does not own. The sender sends a message to the unique owner of that data. The message requests that the receiver perform an update action to an object as described by its object handle. The update is completed at the receiver and then acknowledged as complete by returning a message back to the sender. Any node can update any object by sending a request to the owner of an object for a remote update. A node can self-update objects that it owns. This protocol allows that each data object is owned by a single node which is responsible for all updates to that node. Using this protocol, when any node s software or hardware fails, all data owned by other nodes can be assumed to be in a correct state. 19

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS Communication Fundamental REMOTE PROCEDURE CALL Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Outline Communication Architecture Fundamentals

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions Chapter 1: Solving Integration Problems Using Patterns 2 Introduction The Need for Integration Integration Challenges

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

FaRM: Fast Remote Memory

FaRM: Fast Remote Memory FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main

More information

Review: Hardware user/kernel boundary

Review: Hardware user/kernel boundary Review: Hardware user/kernel boundary applic. applic. applic. user lib lib lib kernel syscall pg fault syscall FS VM sockets disk disk NIC context switch TCP retransmits,... device interrupts Processor

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

1995 Paper 10 Question 7

1995 Paper 10 Question 7 995 Paper 0 Question 7 Why are multiple buffers often used between producing and consuming processes? Describe the operation of a semaphore. What is the difference between a counting semaphore and a binary

More information

Networks and distributed computing

Networks and distributed computing Networks and distributed computing Hardware reality lots of different manufacturers of NICs network card has a fixed MAC address, e.g. 00:01:03:1C:8A:2E send packet to MAC address (max size 1500 bytes)

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG Tom Talpey Microsoft 2018 Storage Developer Conference. SNIA. All Rights Reserved. 1 Outline SNIA NVMP TWG activities Remote Access for

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

Chapter 5 Concurrency: Mutual Exclusion and Synchronization

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles Chapter 5 Concurrency: Mutual Exclusion and Synchronization Seventh Edition By William Stallings Designing correct routines for controlling concurrent

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5 Multiple Processes OS design is concerned with the management of processes and threads: Multiprogramming Multiprocessing Distributed processing

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Warm Standby...2 The Business Problem...2 Section II:

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Disks and I/O Hakan Uraz - File Organization 1

Disks and I/O Hakan Uraz - File Organization 1 Disks and I/O 2006 Hakan Uraz - File Organization 1 Disk Drive 2006 Hakan Uraz - File Organization 2 Tracks and Sectors on Disk Surface 2006 Hakan Uraz - File Organization 3 A Set of Cylinders on Disk

More information

CSCI-GA Operating Systems. Networking. Hubertus Franke

CSCI-GA Operating Systems. Networking. Hubertus Franke CSCI-GA.2250-001 Operating Systems Networking Hubertus Franke frankeh@cs.nyu.edu Source: Ganesh Sittampalam NYU TCP/IP protocol family IP : Internet Protocol UDP : User Datagram Protocol RTP, traceroute

More information

Multifunction Networking Adapters

Multifunction Networking Adapters Ethernet s Extreme Makeover: Multifunction Networking Adapters Chuck Hudson Manager, ProLiant Networking Technology Hewlett-Packard 2004 Hewlett-Packard Development Company, L.P. The information contained

More information

Accessing NVM Locally and over RDMA Challenges and Opportunities

Accessing NVM Locally and over RDMA Challenges and Opportunities Accessing NVM Locally and over RDMA Challenges and Opportunities Wendy Elsasser Megan Grodowitz William Wang MSST - May 2018 Emerging NVM A wide variety of technologies with varied characteristics Address

More information

Networks and distributed computing

Networks and distributed computing Networks and distributed computing Abstractions provided for networks network card has fixed MAC address -> deliver message to computer on LAN -> machine-to-machine communication -> unordered messages

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

ECE519 Advanced Operating Systems

ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor

More information

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 2015 Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2016 1 Question 1 Why did the use of reference counting for remote objects prove to be impractical? Explain. It s not fault

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessor Systems. Chapter 8, 8.1 Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor

More information

SMD149 - Operating Systems - File systems

SMD149 - Operating Systems - File systems SMD149 - Operating Systems - File systems Roland Parviainen November 21, 2005 1 / 59 Outline Overview Files, directories Data integrity Transaction based file systems 2 / 59 Files Overview Named collection

More information

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems Chapter 13: I/O Systems DM510-14 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS Performance 13.2 Objectives

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

Operating Systems : Overview

Operating Systems : Overview Operating Systems : Overview Bina Ramamurthy CSE421 8/29/2006 B.Ramamurthy 1 Topics for discussion What will you learn in this course? (goals) What is an Operating System (OS)? Evolution of OS Important

More information

Silberschatz and Galvin Chapter 12

Silberschatz and Galvin Chapter 12 Silberschatz and Galvin Chapter 12 I/O Systems CPSC 410--Richard Furuta 3/19/99 1 Topic overview I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O requests to hardware operations

More information

IT 540 Operating Systems ECE519 Advanced Operating Systems

IT 540 Operating Systems ECE519 Advanced Operating Systems IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (5 th Week) (Advanced) Operating Systems 5. Concurrency: Mutual Exclusion and Synchronization 5. Outline Principles

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Chapter 1: Introduction What Operating Systems Do Computer-System Organization Computer-System Architecture Operating-System Structure Operating-System Operations Process Management

More information

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD. OceanStor 9000 Issue V1.01 Date 2014-03-29 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2014. All rights reserved. No part of this document may be reproduced or transmitted in

More information

The Client Server Model and Software Design

The Client Server Model and Software Design The Client Server Model and Software Design Prof. Chuan-Ming Liu Computer Science and Information Engineering National Taipei University of Technology Taipei, TAIWAN MCSE Lab, NTUT, TAIWAN 1 Introduction

More information

Chapter 12: I/O Systems

Chapter 12: I/O Systems Chapter 12: I/O Systems Chapter 12: I/O Systems I/O Hardware! Application I/O Interface! Kernel I/O Subsystem! Transforming I/O Requests to Hardware Operations! STREAMS! Performance! Silberschatz, Galvin

More information

Chapter 13: I/O Systems

Chapter 13: I/O Systems Chapter 13: I/O Systems Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS Performance Silberschatz, Galvin and

More information

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition Chapter 12: I/O Systems Silberschatz, Galvin and Gagne 2011 Chapter 12: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware Operations STREAMS

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D, Flavors of Memory supported by Linux, their use and benefit Christoph Lameter, Ph.D, Twitter: @qant Flavors Of Memory The term computer memory is a simple term but there are numerous nuances

More information

Chapter 3: Process Concept

Chapter 3: Process Concept Chapter 3: Process Concept Chapter 3: Process Concept Process Concept Process Scheduling Operations on Processes Inter-Process Communication (IPC) Communication in Client-Server Systems Objectives 3.2

More information

Chapter 3: Process Concept

Chapter 3: Process Concept Chapter 3: Process Concept Chapter 3: Process Concept Process Concept Process Scheduling Operations on Processes Inter-Process Communication (IPC) Communication in Client-Server Systems Objectives 3.2

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches

Background. 20: Distributed File Systems. DFS Structure. Naming and Transparency. Naming Structures. Naming Schemes Three Main Approaches Background 20: Distributed File Systems Last Modified: 12/4/2002 9:26:20 PM Distributed file system (DFS) a distributed implementation of the classical time-sharing model of a file system, where multiple

More information

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8. Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MOVING FORWARD WITH FABRIC INTERFACES

MOVING FORWARD WITH FABRIC INTERFACES 14th ANNUAL WORKSHOP 2018 MOVING FORWARD WITH FABRIC INTERFACES Sean Hefty, OFIWG co-chair Intel Corporation April, 2018 USING THE PAST TO PREDICT THE FUTURE OFI Provider Infrastructure OFI API Exploration

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

IO virtualization. Michael Kagan Mellanox Technologies

IO virtualization. Michael Kagan Mellanox Technologies IO virtualization Michael Kagan Mellanox Technologies IO Virtualization Mission non-stop s to consumers Flexibility assign IO resources to consumer as needed Agility assignment of IO resources to consumer

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

UNIT IV -- TRANSPORT LAYER

UNIT IV -- TRANSPORT LAYER UNIT IV -- TRANSPORT LAYER TABLE OF CONTENTS 4.1. Transport layer. 02 4.2. Reliable delivery service. 03 4.3. Congestion control. 05 4.4. Connection establishment.. 07 4.5. Flow control 09 4.6. Transmission

More information

It also performs many parallelization operations like, data loading and query processing.

It also performs many parallelization operations like, data loading and query processing. Introduction to Parallel Databases Companies need to handle huge amount of data with high data transfer rate. The client server and centralized system is not much efficient. The need to improve the efficiency

More information

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015 Low latency, high bandwidth communication. Infiniband and RDMA programming Knut Omang Ifi/Oracle 2 Nov, 2015 1 Bandwidth vs latency There is an old network saying: Bandwidth problems can be cured with

More information

Multiprocessor Systems. COMP s1

Multiprocessor Systems. COMP s1 Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve

More information

Operating Systems: Internals and Design Principles, 7/E William Stallings. Chapter 1 Computer System Overview

Operating Systems: Internals and Design Principles, 7/E William Stallings. Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles, 7/E William Stallings Chapter 1 Computer System Overview What is an Operating System? Operating system goals: Use the computer hardware in an efficient

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification:

Boundary control : Access Controls: An access control mechanism processes users request for resources in three steps: Identification: Application control : Boundary control : Access Controls: These controls restrict use of computer system resources to authorized users, limit the actions authorized users can taker with these resources,

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001

Utilizing Linux Kernel Components in K42 K42 Team modified October 2001 K42 Team modified October 2001 This paper discusses how K42 uses Linux-kernel components to support a wide range of hardware, a full-featured TCP/IP stack and Linux file-systems. An examination of the

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Operating System Architecture. CS3026 Operating Systems Lecture 03

Operating System Architecture. CS3026 Operating Systems Lecture 03 Operating System Architecture CS3026 Operating Systems Lecture 03 The Role of an Operating System Service provider Provide a set of services to system users Resource allocator Exploit the hardware resources

More information

Client Server & Distributed System. A Basic Introduction

Client Server & Distributed System. A Basic Introduction Client Server & Distributed System A Basic Introduction 1 Client Server Architecture A network architecture in which each computer or process on the network is either a client or a server. Source: http://webopedia.lycos.com

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Real-Time Programming

Real-Time Programming Real-Time Programming Week 7: Real-Time Operating Systems Instructors Tony Montiel & Ken Arnold rtp@hte.com 4/1/2003 Co Montiel 1 Objectives o Introduction to RTOS o Event Driven Systems o Synchronization

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Chapter 2 Computer-System Structure

Chapter 2 Computer-System Structure Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

by Brian Hausauer, Chief Architect, NetEffect, Inc

by Brian Hausauer, Chief Architect, NetEffect, Inc iwarp Ethernet: Eliminating Overhead In Data Center Designs Latest extensions to Ethernet virtually eliminate the overhead associated with transport processing, intermediate buffer copies, and application

More information

Following are a few basic questions that cover the essentials of OS:

Following are a few basic questions that cover the essentials of OS: Operating Systems Following are a few basic questions that cover the essentials of OS: 1. Explain the concept of Reentrancy. It is a useful, memory-saving technique for multiprogrammed timesharing systems.

More information

Lecture 1 Introduction (Chapter 1 of Textbook)

Lecture 1 Introduction (Chapter 1 of Textbook) Bilkent University Department of Computer Engineering CS342 Operating Systems Lecture 1 Introduction (Chapter 1 of Textbook) Dr. İbrahim Körpeoğlu http://www.cs.bilkent.edu.tr/~korpe 1 References The slides

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction Silberschatz, Galvin and Gagne 2009 Chapter 1: Introduction What Operating Systems Do Computer-System Organization Computer-System Architecture Operating-System Structure Operating-System

More information

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2. Overview 1. Product description 2. Product features 1. Product description HPE Ethernet 10Gb 2-port 535FLR-T adapter 1 HPE Ethernet 10Gb 2-port 535T adapter The HPE Ethernet 10GBase-T 2-port 535 adapters

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Module 16: Distributed System Structures

Module 16: Distributed System Structures Chapter 16: Distributed System Structures Module 16: Distributed System Structures Motivation Types of Network-Based Operating Systems Network Structure Network Topology Communication Structure Communication

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

A unified multicore programming model

A unified multicore programming model A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging

More information