The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005*

Similar documents
Anubis User Guide. Paul Murray. For use with version 1.7 Last Update: 13 th October 2005

An Indexing Framework for Structured P2P Systems

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Election Algorithms. has elected i. will eventually set elected i

Distributed Systems (5DV147)

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A Study of Protocols for Low-Latency Video Transport over the Internet

Information Flow Based Event Distribution Middleware

CMSC 425: Lecture 16 Motion Planning: Basic Concepts

Efficient Parallel Hierarchical Clustering

Distributed Algorithms

Space-efficient Region Filling in Raster Graphics

Continuous Visible k Nearest Neighbor Query on Moving Objects

Distributed Systems. 7. Coordination and Agreement

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

Privacy Preserving Moving KNN Queries

Leak Detection Modeling and Simulation for Oil Pipeline with Artificial Intelligence Method

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Equality-Based Translation Validator for LLVM

has been retired This version of the software Sage Timberline Office Get Started Document Management 9.8 NOTICE

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

The Spatial Skyline Queries

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

Statistical Detection for Network Flooding Attacks

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

Allowing the use of Multiple Ontologies for Discovery of Web Services in Federated Registry Environment

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Directed File Transfer Scheduling

Face Recognition Using Legendre Moments

Using Standard AADL for COMPASS

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University.

Theoretical Analysis of Graphcut Textures

Truth Trees. Truth Tree Fundamentals

Recap: Consensus. CSE 486/586 Distributed Systems Mutual Exclusion. Why Mutual Exclusion? Why Mutual Exclusion? Mutexes. Mutual Exclusion C 1

J. Parallel Distrib. Comput.

Semi-Markov Process based Model for Performance Analysis of Wireless LANs

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

Collective communication: theory, practice, and experience

Storage Allocation CSE 143. Pointers, Arrays, and Dynamic Storage Allocation. Pointer Variables. Pointers: Review. Pointers and Types

Simulating Ocean Currents. Simulating Galaxy Evolution

Pirogue, a lighter dynamic version of the Raft distributed consensus algorithm

Introduction to Parallel Algorithms

Randomized algorithms: Two examples and Yao s Minimax Principle

Extracting Optimal Paths from Roadmaps for Motion Planning

Replication Using Group Communication Over a Partitioned Network

Stereo Disparity Estimation in Moment Space

in Distributed Systems Department of Computer Science, Keio University into four forms according to asynchrony and real-time properties.

Control plane and data plane. Computing systems now. Glacial process of innovation made worse by standards process. Computing systems once upon a time

A Novel Iris Segmentation Method for Hand-Held Capture Device

Near-Optimal Routing Lookups with Bounded Worst Case Performance

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

Object and Native Code Thread Mobility Among Heterogeneous Computers

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Skip List Based Authenticated Data Structure in DAS Paradigm

Constrained Path Optimisation for Underground Mine Layout

The Scalability and Performance of Common Vector Solution to Generalized Label Continuity Constraint in Hybrid Optical/Packet Networks

CLIC CLient-Informed Caching for Storage Servers

Short Papers. Symmetry Detection by Generalized Complex (GC) Moments: A Close-Form Solution 1 INTRODUCTION

A Petri net-based Approach to QoS-aware Configuration for Web Services

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Time and Coordination in Distributed Systems. Operating Systems

Implementation of Evolvable Fuzzy Hardware for Packet Scheduling Through Online Context Switching

Tracking Multiple Targets Using a Particle Filter Representation of the Joint Multitarget Probability Density

CS2 Algorithms and Data Structures Note 8

Submission. Verifying Properties Using Sequential ATPG

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

A Model-Adaptable MOSFET Parameter Extraction System

Simultaneous Tracking of Multiple Objects Using Fast Level Set-Like Algorithm

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification

Figure 8.1: Home age taken from the examle health education site (htt:// Setember 14, 2001). 201

TOPP Probing of Network Links with Large Independent Latencies

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

Using Permuted States and Validated Simulation to Analyze Conflict Rates in Optimistic Replication

Deformable Free Space Tilings for Kinetic Collision Detection

New Techniques for Making Transport Protocols Robust to Corruption-Based Loss

Fast Distributed Process Creation with the XMOS XS1 Architecture

REAL NUMBERS. 1.1 Introduction

Efficient Dynamic Programming for Optimal Multi-Location Robot Rendezvous with Proofs

A Robust Implicit Access Protocol for Real-Time Wireless Collaboration

This version of the software

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

Protecting Mobile Agents against Malicious Host Attacks Using Threat Diagnostic AND/OR Tree

SEARCH ENGINE MANAGEMENT

Parametric Optimization in WEDM of WC-Co Composite by Neuro-Genetic Technique

search(i): Returns an element in the data structure associated with key i

Building Polygonal Maps from Laser Range Data

on continuous k-means monitoring over moving objects,

Sage Estimating (formerly Sage Timberline Estimating) Getting Started Guide. Version has been retired. This version of the software

Transcription:

The Anubis Service Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL-2005-72 June 8, 2005* timed model, state monitoring, failure detection, network artition Anubis is a fully distributed state monitoring and failure detection service designed to reort distributed system states in a timely and consistent manner. It is based on a temoral model of distributed comuting that includes the notion of network artitions. Here we describe the roerties of the Anubis service and the rotocols that imlement them, along with a descrition of the Anubis service user interface. * Internal Accession Date Only Aroved for External Publication Coyright 2005 Hewlett-Packard Develoment Comany, L.P.

The Anubis Service Paul Murray Hewlett-Packard Laboratories, Bristol, Filton Road, Bristol BS34 8QZ, UK murray@h.com Abstract Anubis is a fully distributed state monitoring and failure detection service designed to reort distributed system states in a timely and consistent manner. It is based on a temoral model of distributed comuting that includes the notion of network artitions. Here we describe the roerties of the Anubis service and the rotocols that imlement them, along with a descrition of the Anubis service user interface. 1 Introduction The urose of an alication management system is to configure and maintain alications that run in some comuting infrastructure. In Grid and Utility comuting [7][10][17] the infrastructure is shared among users, and their alications are given access to it as needed. In this context we are interested in distributed alications that are rogrammed to be reconfigured on the fly, so that they can be adated to the available ool of resources, recover from failures, or scale to accommodate changing workloads, with management behavior that automatically determines when and how to do this adatation. The management behavior associated with adative alications can be uite comlex and alication secific. A centralized control oint can simlify these issues, but leaves the system exosed to failure of the control oint and network artitioning. Another alternative is to fully distribute the control function. This aroach is exosed to the comlexity of distributed coordination roblems. In either case the ability to interret the actual state of the system is critical to roviding effective dynamic management behaviour. Observation of actual system state in a dynamic distributed environment is the key oint addressed by the Anubis service. Anubis rovides the ability to discover objects, monitor their states, and detect their failure or absence from anywhere in the system. These three caabilities are rovided under a single state based interface that constitutes our view of actual system state. State observation allows multile management agents to take local actions based on local decision rocesses, but this is only useful if they can exhibit globally consistent behaviour. Some articular oints that affect consistency in this context are as follows: The management agents may not be aware of any causality relationshis among system comonents. The interaction among comonents is not tracked by the management agents, so causal ordering cannot be used as a concurrency model. Systems may artition, imlying an inability for agents to observe comonents they are searated from. We would like some form of state consistency criteria that resects artitions, which in turn reuires the need to monitor communication and track artitioning behaviour. Any basic state will have a single source but ossibly multile observers (e.g. management agents). We need to maintain consistency between the comonent and each observer, and among the multile observers. A management agent may choose to comare state observations to determine satisfaction of a distributed redicate or to infer a distributed state (these two actually being euivalent). So we 1

need to maintain consistency between multile distributed state observations such that multile agents can consistently evaluate distributed redicates. Anubis imlements a temoral consistency model and rovides time bound guarantees for disseminating state information across a distributed system. Its state delivery semantics rovide management agents with a rogramming model that allows them to coordinate actions without any interaction other than simle state observation. Additionally, state observation encomasses discovery and failure detection, bringing several management functions under a single framework. The focus of this aer is the Anubis service itself; we do not cover examles of its use, although it has been used in several adative management systems (e.g. [2] and [9]). A discussion of use cases from these systems can be found in [13]. In the next section we describe work that is related to our own, including the foundations that we have built uon and alternative systems that use different aroaches. Section 2 describes the aroach taken in the Anubis service. In section 4 we rovide a formal statement of the Anubis state observation roerties and the rotocols used to imlement them. This section focuses on the communication layers of our service as they rovide the base semantics. The service architecture and imlementation are outlined in section 5, including the higher level state dissemination arrangement. The rogrammatic interface is described in some detail in section 6 along with a toy examle of its use. Finally, we conclude in section 7. 2 Related Work Anubis draws most influence from the fail-aware timed model. The timed model [4] characterizes the degree of asynchrony in a distributed system in terms of communication delay, scheduling delay, and rate of clock divergence, and argues that usually a distributed system will oerate within a set of time bounds for these arameters and that when it does it exhibits uantifiable timeliness roerties. Failawareness [5] adds the notion of an excetion indicator that identifies when these roerties cannot be relied uon, allowing the develoment of rotocols that exloit the timed model. In [6] and [12] the authors describe grou membershi rotocols and (in the later) grou communication, based on the timed model. Anubis is based on a variation of the timed model that uses aroximately synchronised system clocks. We define the notion of a artition view, a single node s view of timeliness in communication aths, and define consistency roerties that relate the artition views of multile nodes to one another. Our version of the fail-aware indicator is a stability redicate that links the consistency roerties of stable and unstable artitions. This aroach rovides the base set of semantics for communication within an Anubis system. We then build state dissemination rotocols that exloit the base semantics to derive our state consistency roerties. In [11] the authors describe an aroach to global redicate evaluation that uses a artial temoral order among states constructed from timestams obtained from aroximately synchronised clocks (known as rough real time in their terminology). Their aroach allows multile observers to consistently evaluate redicates over distributed states, roviding the ability for multile management agents to make consistent decisions based entirely on their own observations of distributed states. The state consistency roerties of Anubis are intended to suort this tye of global redicate evaluation. The Anubis artition view is a tye of grou membershi that has temoral consistency roerties. This can be contrasted with the view synchrony roerties used by many grou membershi rotocols such as ISIS [3] or Totem [1]. View synchrony defines an order constraint relative to grou membershi changes that is established using commit-style rotocols among grou members. Anubis artition roerties are based on a artial order obtained by monitoring the timeliness of communication aths in the background. Anubis has a constant communication overhead in both failfree and failure cases. 2

The state dissemination arrangement in Anubis might be comared to ublish-subscribe event notification systems such as TIB/Rendezvous [14]. However, Anubis does not erform event notification, it erforms state observation. The imortant difference is that Anubis rovides an interretation of the current state of objects in the system, not their ast transitions, and has semantics based on this notion. Astrolabe is a scalable distributed system monitoring service described in [16]. Astrolabe directly addresses scalability in a way that our service does not. It is modelled on DNS, collecting information in zones with the additional ability to roagate aggregated information between zones. Its rogramming model is based on standard database interfaces and an extended form of SQL to define aggregation ueries that determine how information is roagated. Astrolabe has a consistency model called eventual consistency, a weaker model then our temoral consistency. Astrolabe does not identify when eventual consistency has been achieved and it is not achieved in the event of network artitions. However, the weakness of this model is deliberate as it suorts their objective of high scalability. High availability cluster management systems such as HP MC/ServiceGuard [8] include functionality comarable to that of Anubis. Most of these systems use a form of relicated database to store system configuration (including alication configuration) using a transactional consistency model. They are very much targeted towards failover recovery management and do not scale beyond a few tens of nodes. The functionality they rovide can be imlemented by management agents that use Anubis to observe system state. Although this aroach would result in similar functionality it targets larger and more flexible environments. 3 Anubis We develoed the Anubis service to rovide a simle state-based interface suited to automated adative management agents. In our aroach each art of the comuting system reresents its own state; there is no searate coy maintained in a reository, database, or centralized server. Anubis is a fully distributed, self-organizing, state dissemination service that rovides the ability to discover objects, monitor their states, and detect their failure or absence. The service can be viewed as art of an alication management framework and any comonent can use the service by imlementing the aroriate interface to make its own states accessible and to access the states of other comonents. Anubis deals with communication network artitions. When a network artition occurs management agents may wish to coordinate adative actions, but due to delays in communication and ossible asymmetry in the communication aths they may have different views of the state of the system. So we have constructed a consistency model that deals with asymmetric artitions as well as artitions with symmetric communication aths. To do this we used an aroach based on timed communication aths that exloits the timed model as outlined in section 2. 3.1 Providers, Listeners, and States We introduce the following terminology: a state is an arbitrary value; a rovider is a named object that has a state that may change over time; a listener is a named object that observes the states of roviders. It is the job of the Anubis service to distribute rovider states to matching listeners and we say that a listener observes the state of a rovider when it has a coy of its state value. We do not reuire that roviders have uniue names, so a single listener contains a collection of states, each one reresenting a matching rovider. This arrangement is deicted in Figure 1 below. A comonent that wants to make art of its state visible through Anubis it does so by encasulating that state in a rovider object (as done by comonent Y). If a comonent wants to observe a state in the system it does so via a listener object (as done by comonent X). 3

Comonent X Listener to information about Y Y is running Comonent Y Y is running Provider of information about Y Anubis Figure 1 Comonent interaction with Anubis Throughout this aer we use the terms rovider, listener, state, and state observation. We emhasise that roviders have a state and listeners observe states. We will define temoral consistency rorieties relating state observations. 3.2 Partitions and Stability The consistency roerties rovided by Anubis are based on a notion of timeliness in communication aths. The Anubis service is imlemented by a eer-grou of servers sread across a distributed system. Each node monitors its communication aths with its eers and determines a timeliness relation that we call its artition. The nature of a artition is that it exhibits synchronous behaviour (by definition it is defined as a communication relation that is timely). Under normal circumstances all eers will behave in a timely manner and so will all observe the same artition. When this aears to be the case, a server will determine that its artition is stable. A stable artition is not only synchronous, but transitively so. Anubis disseminates state information within artitions. The nature of artitions and stability means that one eer server can determine how long it takes to erform the dissemination and so establish a temoral consistency roerty for state observation. At times the eer-grou may be unstable or go through membershi transitions due to new eers turning u or leaving, or due to network artitions, or failures. At these oints the consistency roerties would be disruted. Moreover, even when a eer determines that the system is stable, it can only really be sure that it has been stable so far, not that it will remain stable. The artition roerties and the resultant state observation consistency roerties we describe deal with these cases. The state observation roerties we describe suort the following caabilities: Time-bound Observation a listener will discover the current state of a rovider within a known eriod of time. Non-existence Observation if a rovider leaves the artition (due to failure or network roblems) a listener will observe its absence within a known eriod of time. More generally, when stable, a listener can determine that a matching rovider does not exist within a known eriod of time. Distributed Predicate Evaluation the states observed by listeners can be correlated with resect to time. Grou Membershi a listener observes the state of all roviders that use the same name. So the listener imlicitly knows the collection of roviders, and hence Anubis imlements grou membershi. Mutually Concurrent Agreement two different listeners listening to the same roviders will detect consistent states within a known time bound of each other; a roerty called 4

mutually concurrent agreement. This roerty allows multile agents to take consistent actions, based on observations, without communicating with each other. In the next section we formally define the Anubis observation consistency roerties and the rotocols that imlement them. 4 Model We model a distributed software system that runs on a comuter infrastructure of comute nodes connected by communication networks. The model is based on the notion of a rocesses that run on comute nodes and time values that are rovided by the comute nodes (system clocks). The model assumes rocesses can communicate via the networks using a message assing aradigm, but attemts to abstract that communication by referring to roerties of the information assed in this way. The model is indexed by a notional real time that is not exlicitly observable by any comonent being modelled. The basics of the model are as follows. A rocess is a finite rocessing agent. We use Λ to denote the set of all ossible rocesses and System to denote a finite set of rocesses that is fixed to reresent a given system. A rocess has a seuence of current states indexed by a notional real time RT modelled using the set of natural numbers : Λ= σ (), t σ ( t+ 1),, σ ( t+ n) where σ () t is the current state of rocess at time t. For simlicity we assume that real time has sufficient granularity to cature each subseuent state of a rocess in a different time slot, but a single state may ersist for multile time slots. Communication between rocesses is reresented by changes in state that encasulate the information that has been communicated. We reserve causality by insisting that information cannot be obtained by the receiving rocess before it exists at the sending rocess. Each rocess has as art of its state a local clock C that rovides local time values. These values are monotonically increasing with real time and are again modelled using the set of natural numbers. We make a fundamental assumtion about local clocks: they have a bounded drift rate ρ from real time that is small in the sense that it is insignificant over short time intervals. Hence, over short intervals we aroximate the rogress of a local clock to the rogress of real time: C ( t) + α = C ( t+ α) + αρ C ( t+ α) where α is small Naturally, small is oen to interretation. In the Anubis context, critical time measurements relate to intervals of no more than a few seconds, or at most minutes. We roose that the clock drift over these time intervals is negligible; as suorted by the analysis in [4]. In ractice clock drift will be controlled using a network time rotocol. In the following we will use the symbols, and z to reresent rocesses, r, s, and t to reresent real time values, and δ, μ and ω to reresent uantities of real time (real time values are additive). We also, rsto, reresent closed intervals, and use the notation ( rs) to reresent oen intervals of real time, [ ] mix the two (e.g. [ rs, ) is closed at the start and oen at the end). 4.1 Observation Consistency The Anubis service gives the user the ability to define state roviders that are oen for observation, and state listeners that can observe the state associated with roviders anywhere in the system. Providers and listeners reside at rocesses and are encoded in their state. A listener observes a rovider so long as the state information it reresents can be transferred to the listening rocess under a given consistency constraint. We define observable() t to be the set of roviders that can be observed by a rocess at real time t with the following consistency constraint. 5

(1), System, stable () t observable () t observable () t observable () t observable () t = ( ) ( ) The constraint uses a stability redicate stable () t that is defined later. Stability is a roerty of communication relative to the rocess at time t. This constraint means that if a rocess is stable, then each other rocess in the system can observe a subset of the roviders that it can, or a disjoint set of roviders. They cannot observe a artially overlaing set of roviders or a suerset. 4.2 Heartbeat Protocol The heartbeat rotocol determines timeliness of one way connections from other rocesses. The notions of a timely rocess and a timely connection from that rocess are synonymous. Timeliness is defined as follows: (2) Let : μ = a fixed length of time (3) Let : ω > 2μ (4), System, recentheartbeat (, ) t σ (t) contains state reresenting a heartbeat from timestamed C ( t) μ (5), System, uiescing (, ) t r, s RT t ω r < s < t timely (, r) timely (, s) (6), System, ( ) ( ) ( ) ( ) timely (, t) recentheartbeat (, t) uiescing (, t) = Processes always have their own state by definition, so they are always timely relative to themselves (as indicated by the last term in (6) above). Note that we secify everything in terms of real time. Heartbeats are time stamed using local clock times, and the definition of recentheartbeat uses a comarison of time stams to local time, however it defines timeliness in terms of real time. Definition (4) is left informal deliberately as we believe its meaning is clear. In addition to timestams, the heartbeat messages can iggy-back other state information. Uer rotocol layers maintain state information that is communicated in heartbeats. Each heartbeat carries only the version of the state information that was current when it was sent. We use I to denote this information. The information determined by rocess about rocess at time t is denoted I ( t, ). State information assed in heartbeats has the roerty of being timely: (7), System, timely (, t) I (, t) = 6

(8), System, timely (, t) s s t I (, t) = I (, s) Note the subtlety here: not all state information will be communicated, even if it is timely. Only information that was current at the time a heartbeat is sent gets communicated. In addition, unreliable transorts may omit heartbeats. We assume reliable transorts do not omit heartbeats and deliver them in order when they are timely. We assume reliable transorts rovide no guarantees at all when untimely. 4.3 Connection Set We define the connection set and connection set counter as follows: (9), System, { CS (, t) = timely (, t) (10), System, s, t RT s < t CS (, s) CS (, t) CSCounter (, s) < CSCounter (, t) (11), System, CS (, t) = CS (, t + 1) CSCounter (, t) = CSCounter (, t + 1) By substituting CS and CSCounter for state information I in (7) and (8) in the heartbeat rotocol we can see that each rocess has access to connection sets and connection set counters for each rocess it considers timely: i.e. the ones in its own connection set. The connection set counter allows one rocess to detect that another rocess connection sets have changed, even if the change is not reflected in the connection set delivered in the heartbeat. This is how we deal with the fact that not all connection set changes will be communicated. 4.4 Stability We define stability as follows: (12) Let : δ = 2μ (the length of the stability interval) (13) System, t RT consistent ( t) System CS (, t) CS (, t) = CS (, t) (14) System, t RT stable () t consistent () t (15) System, t RT ( δ ] stable () t CS (,), t s t, t CSCounter (, s) = CSCounter (, t) In other words, the rocess is stable when all the connection sets it knows about are consistent (14) and have all remained unchanged for the interval δ (15). We call the δ the stability interval. Notice that stability is relative to a rocess. 7

4.5 Partition Protocol Each rocess has a view of the system that may change over time called its artition. Partitions satisfy the following roerty (the artition roerty): (16), System, ( ) ( ) stable () t artition () t artition () t artition () t artition () t = The artition rotocol derives rocess stability and its artition. The rotocol oerates by communicating information about timely connections in heartbeats. The artition is established by maintaining the largest set that satisfies the following rules: (17) System, t RT stable () t artition () t = CS (,) t (18) System, t RT artition () t CS (,) t (19), System, CS (, t) artition ( t) (20) System, s, t RT ( ) [ ) ( ) s < t r s, t stable ( r) artition ( t) artition ( s) Definition (20) states that the artition of a rocess can only shrink so long as the rocess is unstable. The only time a artition can grow is when it is stable at which oint it becomes eual to the connection set (17). Rules (18) and (19) define further constraints on the artition. The Aendix contains roof that these rules construct a artition view that satisfies the artition roerty (16). 4.6 Leader Election We associate a leader leader (, t) with each rocess that satisfies the following roerty: (21) System, t RT stable () t artition () t stable () t leader () t = leader () t (22) System, t RT stable () t leader ( t δ ) = leader () t leader () t The first of these roerties states that all stable members of the same artition agree on the leader. The second states that the leader of a stable rocess already selected itself leader, regardless of stability. Note that the second does not state that the selected rocess still considers itself leader. The simlest leader election rotocol that suorts these roerties is based on a total order among rocesses: the rocess with the greatest order in the artition is the chosen leader. The roofs for this rotocol are trivial. The first can be derived from the artition roerty (16) and the second from lemma (24) in the aendix. 8

4.7 State Dissemination We can state the following time-bounded state dissemination roerty for communication in artitions: (23), System, ( δ ) artition () t s t, t I (,) t = I (, s) This roerty states that any information assed within a artition is known to be u no more than δ time units old. The roerty is roved for our artition rotocol in the aendix. We noted earlier that communication using heartbeats over an unreliable transort does not guarantee that information will not be omitted, only that any information received is timely. However, if we use ordered, reliable communication transorts then all information sent in heartbeats will be received in order within the known time bound so long as the communication is timely. We consider a rovider (or rather the rovider s state value) to be observable by a rocess at time t if the rovider resides at a rocess artition () t, and we use observable() t to denote the set of rovider state values that are observable by rocess at time t. Hence the observation consistency roerty (1) is derived directly from the artition roerty (16). We restate the observation consistency roerty here:, System, stable () t observable () t observable () t observable () t observable () t = ( ) ( ) This roerty is distinct from actual observation which is a roerty of listeners not rocesses. Just because a rocess is able to observe a rovider, it does not mean it contains a listener for that rovider, or that the associated state value for that rovider has been communicated. State information is assed in heartbeats as reuired, so state observation is subject to the timebounded state dissemination roerty above. In essence, a rovider becomes observable by a listener as soon as it enters the listeners artition. It will actually be observed within δ time units (so long as it remains in the artition). In ractice we communicate rovider state values between rocesses on demand: we need to locate and bind rocesses that need to disseminate information. This is done using an arrangement that exloits the leader rocess as described in the imlementation section below. This discovery ste extends the initial time bound for observation of a rovider state value to 3δ. However, this is the uer limit and once done the time-bounded state dissemination roerty above alies. 5 Imlementation The Anubis service is imlemented entirely in Java as a eer-grou of servers using a layered architecture shown in Figure 2 below. A server in the imlementation directly corresonds to a rocess in the system model described in the revious section. At the lowest level the UDP and TCP rotocols are used to imlement multicast and oint-to-oint ordered communication resectively. 9

Predicate Evaluation Server Interface Listener Listener Listener Listener Provider Provider Provider Provider State Manager Partition Manager Timed Communication UDP/TCP Communication Figure 2 Anubis server architecture. 5.1 Timed Communication The next layer imlements timed communication between servers and rovides connection based messaging to the uer layers. Anubis assumes the existence of a searate clock synchronization rotocol to kee system clocks aligned. The communication ath between servers is considered timely if the servers are able to exchange messages within the chosen time bound δ. Timeliness is checked by using time stams to measure the transmission delay of heartbeat messages sent at regular intervals (of less than μ). The value of δ is chosen to allow for a reasonable clock skew, transmission delay, and rocessor scheduling delay, each of which contribute to a erceived communication delay. We say that a server is timely with resect to the local server if its communication is timely. A server is always timely with resect to itself. When there are no other messages to send, heartbeats are transferred using UDP multicast. A single heartbeat is delivered to all servers listening on the multicast grou and used for measuring the delay from its source. The rotocol creates a messaging overhead that is linear with the number of servers and deendent on the freuency of heartbeats. When uer layers wish to send messages to other servers they reuest that a message connection is established. At this oint the timed communication layer switches to using a TCP connection with that articular server. This connection is also timed and rovides reliable, ordered delivery. Message connections are removed when they are no longer needed as they add to the communication overhead. The timed communication layer resents a view of all the servers that are currently timely; this is the connection set defined in section 4.3. The timed communication layer will only oen message connections with servers in its connection set. All errors on a message connection are interreted as untimely communication inability to deliver messages on time. If a message connection becomes untimely it dros out of the connection set. So the timed communication layer rovides a very simle use model for messaging: either a server is in the connection set and any messages sent to it (or received from it) are delivered in a known time bound δ or it is droed from the connection set. All communication errors are reresented by a change in the connection set. No other error notifications are given. Lastly, the timed communication layer iggy-backs a coy of its current connection set with an associated time stam each time it sends a heartbeat message. This information is used by the artition manager. 10

5.2 Partition Manager The artition manager layer imlements a rotocol that derives the artition view with the roerties defined in section 4.5. Because the timed communication layer iggy-backs its connection set on heartbeats, the artition manager has access to a recent connection set from each of the timely servers, where recent is defined as no older than δ time units. The artition manager uses these connection sets to derive the artition view, and its stability, according to the rules defined in section 4.5, as follows: 1. If there is a change in the connection sets and they are all consistent (i.e. the same), stam the local connection set with the time now+δ, mark the artition view unstable, but do not change the membershi. 2. If the connection sets are consistent and unchanged and the local clock time has reached the recorded time stam, set the artition view eual to the local connection set and mark it stable. 3. If a remote server dros out of the local connection set or the local server dros out of a remote server s connection set, remove the remote server from the artition view and mark it unstable. 4. If a new server aears in the connection set, mark the artition view unstable but do not change the membershi. 5. If any other changes occur in the connection sets, mark the artition view unstable but do not change the membershi. 1 identifies that the connection set may be transitive, but due to the communication delay we cannot be certain that all servers have recognized this fact yet; some may even think the artition view is larger. 2 defines stability: at this oint all servers in the artition view definitely have observed the transitivity and definitely have artition views that are subsets of the local connection set. 1, 3, 4, and 5 all demonstrate that any change in the connection sets reresents instability. We note that there is only one condition under which a artition view may get larger, and that is when it becomes stable in 2. This is where the artition view differs from the connection set reorted by the timed communication layer, and enforces the roerty that if one server is stable with artition view P then the artition views of all other servers are either subsets of P or do not intersect P. The artition manager is also resonsible for the leader election rotocol. The rules defined in section 4.6 are used to select a candidate that is in the local artition view. All servers with the same artition view elect the same leader regardless of stability. The artition view rotocol does not change the communication attern of the timed communication layer at all. All its communication is rovided by data iggy-backed on regular heartbeat messages, so there is no messaging overhead for the rotocol and minimal rocessing overhead. By encoding information such as connection sets as bit sets, we are able to cover more than a thousand Anubis servers in a single MTU ayload. Lastly, the artition manager layer also generates a artition base time reference. All connection sets are time stamed. When the local server is unstable its base time reference is undefined, but when stable the reference is eual to the latest connection set time stam. All stable servers in the same artition will agree on the base time reference. This can be used as a basis for consistency when correlating among users when correlating similar distributed state redicates. 5.3 State Manager The state manager layer sits on to of the artition manager layer and rovides state dissemination between roviders and listeners within artitions. The state manager imlements a very simlistic location rotocol. The server that is elected leader by the artition manager becomes a global locator 11

and all servers register the sets of rovider names and listener names they contain. The global locator informs servers with roviders which servers hold corresonding listeners. A server with a rovider is resonsible for forwarding the roviders state value to any matching listeners. If a rovider (or listener) deregisters its server informs matching listeners (or roviders). All these message exchanges are one way. The fact that the communication aths are timed imlies reliable delivery within the known time bound. If there is a change in the artition membershi the state manager will treat remote roviders and listeners on servers now outside the local artition as if they had deregistered. The delivery of state information among roviders and listeners is based on the artition view reorted by the artition manager and imlements the state consistency roerties defined. Our imlementation adds time stams to state values when they change. The aroximately synchronized clocks allow comarison of indeendent states as described in [11]. New states are transferred immediately from the rovider to all matching listeners and arrive within the known time bound δ. This is sufficient to erform distributed redicate evaluation over observed states with the same consistency roerties of state values. The artition base time reference defines a time at which all servers have a known consistency, roviding a simle means for all servers in a single artition to erform consistent evaluations. 6 Programmatic Interface There is a single Java interface class for the Anubis service called AnubisLocator. In addition there are three Java base classes that reresent roviders, listeners and values called AnubisProvider, AnubisListener and AnubisValue resectively, and an interface called AnubisStability. 6.1 AnubisLocator The AnubisLocator interface has six methods that are of interest to the user. These are: registerprovider(...), deregisterprovider(...), registerlistener(...),deregisterlistener(...), registerstability( ), and deregisterstability( ). The first two, as the names suggest, register and deregister roviders resectively. The signatures for these are given below. Each takes an AnubisProvider class instance as its one arameter. This is the rovider that is being registered/deregistered and is described in the next section. When a rovider has been registered with the Anubis service using registerprovider(...) it becomes visible to the service and observable within its artition, so listeners will become aware of its resence as aroriate. When it has been deregistered using deregisterprovider(...) it will cease to be visible to the service and any listeners that are aware of it will become aware of its absence (i.e. they will recognise that the corresonding value has gone). void registerprovider(anubisprovider rovider); void deregisterprovider(anubisprovider rovider); The second two register and deregister listeners. The signatures for these are given below. Each takes an AnubisListener class instance as its one arameter. This is the listener that is being registered/deregistered, described in section 6.3. When a listener is registered with the Anubis service it will become aware of values as aroriate; when it is deregisters it will cease to be informed of any values (its behaviour becomes undefined). void registerlistener(anubislistener listener); void deregisterlistener(anubislistener listener); 12

The last two methods register and deregister a stability notification interface. The signatures are given below and the interface is described in section???. This interface is used to inform the user when the server becomes stable or unstable, and the artitions base time reference. void registerstability(anubisstability stability); void deregisterstability(anubisstability stability); The AnubisLocator interface is re-entrant. By this we mean that it is ok to call the interface during an ucall from the Anubis service. 6.2 AnubisProvider The AnubisProvider class is used to rovide values to the Anubis service. The user creates a named object by constructing an instance of this class. The class has two basic methods of interest: AnubisProvider(...) the constructor, and setvalue(...), which is used to set the current value associated with the class. The signatures for these methods are given below. AnubisProvider(String name); void setvalue(object obj); When the AnubisProvider class has been constructed its default value will be null. Anubis recognises null as a valid value and differentiates it from the absence of a value. A value is only absent when there is no rovider for it. The only restriction on the value that can be set is that it is a serializable class i.e. it imlements the java.io.serializable interface. These values will be communicated by assing their serialized form in messages. The setvalue(...) method can be called at any time, before or after registering the rovider. Listeners may observe the value that the rovider holds at the time it is registered with Anubis. If this is not desired the setvalue(...) method should be used to set the value before registering the rovider using the AnubisLocator interface. 6.3 AnubisListener The AnubisListener class is used to observe values in the Anubis service. This is an abstract class because it contains methods that must be defined by the user, so the AnubisListener class must be secialised. The AnubisListener can be thought of as a view on the observable set of states in the local artition. It will contain a collection of all the values associated with the name of the listener, one for each rovider with that name. The values in the collection are reresented by instances of the AnubisValue class, described below. The user will be notified when changes in the collection occur using ucalls. These changes will occur according to the time-bounded state dissemination roerty (taking the initial discovery delay into account). A new AnubisValue will aear within 3δ of there being an associated rovider in the artition so long as the local Anubis server is stable. Udates of values already in the collection will occur within δ of the associated rovider being udated regardless of stability. Values will be removed from the collection within δ of the associated rovider exiting the local server s artition, whether due to deregistering or failure and regardless of stability. 6.3.1 Basic Use The following basic methods are of interest: AnubisListener(...) the constructor, newvalue(anubisvalue value), which indicates a change in value (either a new rovider has aeared or the value rovided by an existing one has changed), and removevalue(anubisvalue value), which indicates that a value has disaeared (either because the rovider has deregistered or become artitioned either way it is no longer visible). The signatures of these methods are given below. AnubisListener(String name); 13

void newvalue(anubisvalue value); void removevalue(anubisvalue value); When the AnubisListener class is constructed it is given a name. This is the name that will be used to identify values in the Anubis service. The name cannot be changed once assigned. When the AnubisListener class instance is registered via the AnubisLocator interface it will become aware of any values associated with its name that are observable. This occurs as follows: 1. For each rovider that is observable through Anubis an instance of AnubisValue will be created in the AnubisListener s collection and the newvalue(...) ucall will be invoked, also assing the AnubisValue class as the arameter of the call. A single AnubisValue class instance directly corresonds to a single AnubisProvider instance and it will be created within 3δ of the rovider s aearance in the local artition, so long as the local server is stable. 2. If the value associated with a rovider changes, the value associated with the corresonding AnubisValue instance changes and the newvalue(...) ucall will be invoked again, referring to the aroriate AnubisValue. This will haen within δ of the change occurring, regardless of stability. 3. If a rovider deregisters or disaears, the corresonding instance of AnubisValue will be removed from the AnubisListener s collection and the removevalue(...) ucall will be invoked, assing the AnubisValue instance as its arameter. The AnubisValue instance will contain the null value. This will haen within δ of the rovider s disaearance. 6.3.2 Secialising the value handling AnubisValue can be secialised to create a user-defined subclass. The AnubisListener uses a factory method called createvalue(...) to construct instances of AnubisValue. If the AnubisValue class is secialised the user will need to over-ride the definition of createvalue(...) to construct the new subclass instead. The signature of createvalue(...) is given below. AnubisValue createvalue(providerinstance i); The resence of the ProviderInstance class in the interface is regrettable as it is an internal reresentation of which the user should not be aware. The AnubisValue class constructor takes this arameter, so the user s constructor should look something like: MyAnubisValue(ProviderInstance i) { suer(i); /* any other code */ A common use attern for this feature is to turn ucalls on the AnubisListener class into ucalls on instances of the user class related to the values, as in the following examle: 14

MyAnubisListener extends AnubisListener { MyAnubisListener(String name) { suer(name); void createvalue(providerinstance i) { return new MyAnibsValue(i); void newvalue(anubisvalue value) { ((MyAnubisValue)value).newValue(); void removevalue(anubisvalue value) { ((MyAnubisValue)value).removeValue(); 6.3.3 Further convenience methods There are a few other methods included in the AnubisListener class for the user s convenience. These are given below: String getname(); long getudatetime(); int size(); Collection values(); The getname() method returns the name of the listener. The getudatetime() method returns the most resent time stam associated with any value in the collection. Note that the most recent time stam does not necessarily corresond to the most recent addition or change to the collection. Although Anubis rovides time-bounded state dissemination roerties, and ordered udates from a articular rovider, it does not rovide time-ordered udates between multile roviders. The size() method returns the number of values in the listener s collection. The values() method returns a collection view of the listener s AnubisValue class instances. 6.4 AnubisValue The AnubisValue class is used as a container for the actual values associated with a listener. The value contained by an AnubisValue instance is a coy of the object that was set for the associated rovider. In addition to containing the actual value the AnubisValue class holds a some extra information including the name associated with the value (the same name used by the listener and the rovider), the time stam assigned to the value (the time stam is set by the rovider when the value is set or when the rovider registers, and by the local node when the rovider disaears), and a globally uniue identifier for the rovider. The signatures for these methods are given below. String getname(); String getinstance(); long gettime(); Object getvalue(); The getname() method returns the name, getinstance() returns the global identifier, gettime() returns the time stam, and getvalue() returns the actual value object. Note that when a value has been removed as indicated by the AnubisListener.removeValue(...) ucall it will contain the null value and the time stam will reflect the time that the corresonding rovider deregistered or the time that the local node determined its disaearance. 15

6.4.1 Using the uniue identifier The globally uniue identifier (or instance) is actually of little use to the user unless he wants to recognise a rovider that disaeared (say due to a network artition) and then reaeared. In this eventuality the rovider will still have the same uniue identifier, although when it reaears the listener will construct a new AnubisValue container for it. This can be contrasted with the case where a rovider deregisters and then registers again, in which case Anubis will treat it as a new rovider and assign a new globally uniue identifier. 6.5 AnubisStability AnubisStability is an interface used to notify users when the local server becomes stabile or unstable and the base time reference for the artition when stable. The interface contains a single method: void stability(boolean stable, long timeref); When the interface is registered the local server uses it to make an initial call that indicates its current stability status and then makes subseuent calls when its status changes. If the server is stable the timeref arameter will be the base time reference for the local artition. If unstable the timeref arameter will be set to -1. 6.6 Examles In this section we give an examle use of the Anubis service interfaces to imlement a grou membershi monitor. In this scenario we create an object that observes a collection of objects in the system and their status. In this case the status will indicate if they are in the initialising or started states in their life cycle. Their absence will indicate either that they don t exist or that they did, but have terminated or were otherwise removed from the system. The objects that are going to be observed use the rovider interface. The object has three simle stes to imlement: 1. It obtains a reference to the Anubis server interface. locator = (AnubisLocator)sfResolve("anubisLocator"); We assume here that the object has access to a configuration/service interface location system that finds the Anubis server interface. In our examle we are using SmartFrog [2][15]. 2. It creates an instance of the rovider class and sets its value. rovider = new AnubisProvider(myName); rovider.setvalue("initialising"); 3. It registers the rovider class with the Anubis service. locator.registerprovider(rovider); These stes are erformed during an initialisation hase and result in the object state reresented by the value initialising being visible in the Anubis service under the name held in the by the variable myname. During a start hase the value of the rovider is set to started to indicate its new state. When terminating, the object deregisters the rovider from the Anubis service. The grou membershi monitor secialises the AnubisListener base class and uses a secialised version of the AnubisValue class. These are imlemented as the GrouListener and GrouMember classes below. 16

Public GrouListener extends AnubisListener { ublic AnubisValue createvalue(providerinstance roviderinstance) { return new GrouMember(roviderInstance, this); ublic void newvalue(anubisvalue v) { System.out.rintln("New value in grou " + getname()); ((GrouMember)v).newValue(); ublic void removevalue(anubisvalue v) { System.out.rintln("Removed value from grou " + getname()); ((GrouMember)v).removeValue(); The GrouListener class is an AnubisListener class that has been secialised to create instances of the GrouMember class instead of the default AnubisValue. This is done by over-riding the createvalue(...) method as shown. We have also imlemented the newvalue( ) and removevalue( ) methods in GrouListener to invoke calls on the instance of GrouMember referred to in the u-call. The urose here is to convert u-calls on the listener into u-calls on the classes that reresent the values themselves. The GrouMember class simly imlements a version of the AnubisValue class that oututs messages on the outut stream to indicate what is haening. The class has the two additional methods called by the GrouListener class. ublic GrouMember extends AnubisValue { rivate GrouListener grou; ublic void GrouMember(roviderInstance i, GrouListener grou) { suer(i); this.grou = grou; ublic void newvalue() { System.out.rintln("New value in grou " + grou.getname() + " = " + getvalue()); ublic void removevalue() { System.out.rintln("Removed value from grou " + grou.getname()); Any object that wants to observe a grou of objects using the grou listener class simly constructs an instance of GrouListener and registers it with Anubis as follows: locator = (AnubisLocator)sfResolve("anubisLocator"); listener = new GrouListener(grouName); locator.registerlistener(listener); 17

When instances of the roviders observed by these classes aear, change value or disaear the code above simly oututs messages to indicate what has been observed in the grou. For examle, if listening to the name fred and an object initialises itself creating the rovider as shown above the listener would outut: New value in grou fred New value in grou fred = initialising And the object is started the listener would outut: New value in grou fred New value in grou fred = started (from GrouListener) (from GrouMember in initialising state) (from GrouListener) (from GrouMember in started state) Notice that the instance of the AnubisMember class does not change when it has a new value. The listener does understand that it is the same object that was initialising that is now started. Note also that the values that have been assed are the strings initialising and started. Any serializable objects can be used as values; they do not need to be strings. If the user wants to distinguish the identity or location of the objects being observed, this information should be encoded in the state values. Likewise, the user could obtain a remote interface to the object being observed if a reference to it is encoded in the state value. 7 Conclusion We have described the imlementation of the Anubis state monitoring service and the semantics that it rovides. Anubis enables multile user agents sread across a distributed system to attain a consistent view of the state of objects in the system according to a temoral consistency model, roviding a sufficient foundation for fully distributed management of distributed systems. 18