RATATOSKR: WIDE-AREA ACTUATOR RPC OVER GRIDSTAT WITH TIMELINESS, REDUNDANCY, AND SAFETY

Size: px

Start display at page:

Download "RATATOSKR: WIDE-AREA ACTUATOR RPC OVER GRIDSTAT WITH TIMELINESS, REDUNDANCY, AND SAFETY"

Melissa Carpenter
6 years ago
Views:

1 RATATOSKR: WIDE-AREA ACTUATOR RPC OVER GRIDSTAT WITH TIMELINESS, REDUNDANCY, AND SAFETY By ERLEND SMØRGRAV VIDDAL A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN COMPUTER SCIENCE WASHINGTON STATE UNIVERSITY School of Electrical Engineering and Computer Science DECEMBER 2007

2 To the Faculty of Washington State University: The members of the Committee appointed to examine the thesis of ERLEND SMØRGRAV VIDDAL find it satisfactory and recommend that it be accepted. Chair ii

3 ACKNOWLEDGEMENT I would like to thank my advisor Dave Bakken for his advice and guidance throughout my studies at WSU, and for taking an active interest in the well-being of his students beyond professional obligations. I would also like to thank Carl Hauser and Min Sik Kim for taking the time to be on my committee, and especially Carl Hauser for his help with the work on my thesis. Further, I would like to tank all past and current members of the GridStat team for their great work, and for valuable discussion and contributions on my research. A special thanks goes to my friends in Norway and in Pullman, and my family for their continuing support during my research, and for making my stay here much more enjoyable. Finally, I would like to thank the organizations that have provided financial support for education and research. In particular, I have received a stipend from The Norwegian State Educational Loan Fund and tuition reduction from Washington State University. In addition, my research has been supported in part by grants CNS (CT-CS: Trustworthy Cyber Infrastructure for the Power Grid(TCIP)) and CCR from the US National Science Foundation. iii

4 PUBLICATIONS Erlend S. Viddal, Stian Abelsen, David Bakken and Carl Hauser, Ratatoskr: Wide-Area Actuator RPC over GridStat with Timeliness, Redundancy, and Safety, in DSN 08: Proceedings of the International Conference on Dependable Systems and Networks (DSN 08). To be submitted in Fall 2007 iv

5 RATATOSKR: WIDE-AREA ACTUATOR RPC OVER GRIDSTAT WITH TIMELINESS, REDUNDANCY, AND SAFETY Abstract by Erlend Smørgrav Viddal, M.S. Washington State University December 2007 Chair: David E. Bakken The development of the communication infrastructure for the north-american electrical power grid has failed to fully incorporate important developments in the field of computer science, affecting the stability and efficiency of the power grid as a whole. The current power-grid communication standard, SCADA, utilizes protocols specialized for centralized communication, hampering communication between field sites key for envisioned improvements of power grid safety and efficiency. Further, a number of different proprietary communication protocols are in use, making communication between power utility companies very difficult. GridStat is a communication infrastructure designed for a power grid environment that solves many of the problems with the current situation. GridStat uses a specialization of the publish-subscribe middleware paradigm, status dissemination, that takes advantage of the semantics of status data to provide flexible acquisition of power-grid data with multiple dimensions of QoS semantics. The middleware approach enables communication between utilities independent of proprietary network protocols, and allows enhanced network features such as forwarding data through multiple redundant paths. While GridStat provides excellent support for data acquisition, the publish-subscribe architecture supports only one-way communication and provides syntax and semantics unsuitable for control communications. This thesis presents Ratatoskr, a novel scheme for control of actuators using GridStat v

6 communication. It constructs a two-way communication channel on top of GridStat publish/subscribe paths, and utilizes the QoS semantics and middleware properties GridStat provides. For control communication Ratatoskr uses remote procedure call (RPC), providing programmer friendliness and familiarity. The QoS semantics of GridStat are drawn upon to provide the timeliness required for power-grid operation. Reliability concerns are addressed by providing three redundancy schemes, ACK/resend, transmitting multiple copies of a single packet, and spatial redundancy through GridStat s redundant routing paths feature. Additionally, pre- and post-condition expressions over GridStat status variables are built into call semantics. The architecture and design of Ratatoskr is presented, along with results from an evaluation of a prototype implementation. vi

7 TABLE OF CONTENTS Page ACKNOWLEDGEMENTS iii PUBLICATIONS iv ABSTRACT vi LIST OF TABLES x LIST OF FIGURES xi CHAPTER 1. INTRODUCTION Current Power Grid Communication Infrastructure GridStat Ratatoskr Contributions of Thesis Organization of Thesis BACKGROUND AND RELATED WORK Middleware Remote Procedure Call Failure Semantics CORBA Publish/Subscribe Status Dissemination vii

8 2.4 GridStat Architecture THE RATATOSKR RPC MECHANISM Definition of terms Two-way Communication over a Publish-Subscribe Framework Properties of the 2WoPS Protocol Reliability Measures The Ratatoskr RPC RPC semantics Pre-and Post Conditions Limitations Assumptions DESIGN OF RATATOSKR Design of the 2WoPS Transport Protocol Modules Sending Process Design of the RPC Mechanism Modules Use of Reflection and Serialization RPC Flow Pre- and Post-conditions EVALUATION Evaluation Procedure Topology viii

9 5.1.2 Network Fault Model Evaluation Testbed Processes Hardware Garbage Collection Handling Java Virtual Machine Arguments Experiment Procedure Result Data Expected Results Experimental Results Resiliency of Temporal Redundancy Resiliency of Spatial Redundancy Comparison to Traditional RPC CONCLUSION AND FUTURE WORK Concluding Remarks Future Work Long Term Connections Fault Tolerance Level Calculation Extensions to the 2WoPS Protocol Extensions to the RPC Mechanism Security Future Evaluations BIBLIOGRAPHY ix

10 LIST OF TABLES Page 3.1 Comparison of Redundancy Techniques Expected Failure Rates for Redundancy Techniques Calculated End to End Loss Compared to per Link Loss x

11 LIST OF FIGURES Page 3.1 Ratatoskr Module Stack Sending Process for the 2WoPS Protocol RPC Send Process Call Process When Failing Pre-Condition Call Process With Post-Condition Evaluation Topology Comparison of Performance With and Without Garbage Collection Compensation Early Success for Temporal Redundancy over Varying Omission Fault Rates Early Success for Spatial Redundancy over Varying Duration Faults Early Success for Varying Redundancy and Loss Average Calltimes for Various Redundancy with Full Loss Cumulative distributions of number of timeouts per call xi

12 CHAPTER ONE INTRODUCTION The North-American electric power-grid is among the largest and most complex systems created by man. Its critical mission of balancing changing demand and generation of power involves coordinating diverse sets of components over a very large areas, and in a large number of utilitydomains. This balancing process requires extensive communication between components in the Grid for monitoring system state and controlling actuator devices. The development of the grid communication infrastructure has failed to incorporate important developments in the field of computer science, affecting the stability and efficiency of the power grid as a whole, [2]. GridStat is a communication infrastructure designed for a power grid environment that would solve many of the problems with the current situation, but it does not conveniently control communication, [8]. This thesis proposes a novel scheme for control of actuators using GridStat communication. 1.1 Current Power Grid Communication Infrastructure In the 1960s, utilities started shifting from mainly using field personnel and telephone communication for control of the power grid to electronic schemes. Today the predominant Grid communication architecture is SCADA (Supervisory Control and Data Acquisition). The SCADA architecture has not changed notably from its origins. It is a centralized approach, in which a manned regional control center gathers data from and issues control signals to devices in geographically dispersed field sites. Early systems were developed without any official standards, resulting in numerous proprietary protocols. SCADA systems have since developed incrementally, and often incorporate a blend of new and old communication technology. Topologies are predominately varieties of starshapes, and protocols are mostly designed solely for communication between control center and field sites, [12]. 1

13 With increasing stress on the transmission network, distribution models growing more complex and looming threats of terrorism and cyber security risks, there is a pressing need for better monitoring of grid dynamics and improved control schemes, [2]. The inherent inflexibility of the SCADA architecture is unable to accommodate this. Communication between utilities is mostly done by telephone between operators, making observation and containment of grid-wide phenomena such as rolling blackouts very difficult. Fast automated control schemes involving substation to substation communication have yet to be standardized, and are implemented using expensive, specialized point-to-point links, [2]. The Intelligrid project, a vision of a future power grid created by an international consortium of power researchers, industry representatives, equipment manufacturers and government representatives, argues for several applications of communication substation to substation, substation to field equipment, and field equipment to field equipment, yet it does not propose a wide-area communication mechansism, [6]. IEC is a widely accepted standard for substation automation that includes standardized self description of devices independent of brand and an event-driven communication model, [17]. While IEC holds great potential for improved substation control, it does not specify a wide-area network mechanism in itself. Continued incremental development of the existing centralized and inflexible communication structure will severely inhibit potential growth in power-grid efficiency and stability. 1.2 GridStat Gridstat is a framework for power-grid communication centered around a middleware network for power-grid data acquisition, [8]. It provides a flexible communication scheme with the reliability and timeliness required in a power-grid network. GridStat routes traffic on top of existing communications infrastructure through a series of application-layer routers, overcoming the inherent heterogeneity of legacy networking technology. The unifying middleware framework creates a 2

14 flexible overlay topology on top of the centralized designs of existing power-grid network infrastructures, allows for easy interoperability between power utility companies despite use of proprietary transport protocols and offers abstractions to network services, in addition to several other features well suited for a power-grid infrastructure that are less relevant in context of this thesis. GridStat follows the publish-subscribe (pub-sub) paradigm. A device can publish status information either directly to the GridStat framework, or through an intermediary middleware publisher module, possibly located on another computer. The GridStat framework makes the information available as one or more status variables, published values that are regularly updated. Applications may retrieve status updates by subscribing to status variables through a GridStat subscriber interface. The GridStat framework forwards status updates from the publisher through the application-layer routers and finally to the subscriber. This overlay-network scheme allows Grid- Stat to offer a wide range of network features independent of the underlying technology. The most important of these are multicast and redundant forwarding paths (for fault-tolerance). In addition to offering functionality additional to that provided by the underlying network, GridStat improves the network Quality of Service (QoS), the nonfunctional properties of the network. QoS enhancments provided by GridStat include bounded delay, reliability and security. Currently GridStat forwards status updates in a one-way, pub-sub fashion, addressing the data acquisition needs of a grid operations infrastructure. While it would be possible to forward control commands using the existing status update mechanism, such communication would be cumbersome with the pub-sub interface and in many cases would require operation success feedback which is impossible over the one-way paths. Use of SCADA protocols for control while restricting use of GridStat to data acquisition would require modifying inflexible proprietary legacy code for each new control operation introduced, and would not be able to utilize the flexible topology and interoperability introduced with GridStat. Use of other existing QoS-enabled control schemes would require implementing an overlay transport protocol to allow interoperability and flexible 3

15 topologies, which is redundant when GridStat already provides middleware routing. Further, existing solutions would not be designed with the capabilities already found in GridStat in mind, and mechanisms exploiting these would have to reside in the application layer voiding any advantages that could be achieved by designing use of these features into the control semantics. 1.3 Ratatoskr This thesis proposes a power grid control scheme, Ratatoskr 1, using GridStat publications and subscriptions for communication. Ratatoskr is designed primarily for control of field sites from a control center, but use between field sites is imaginable. Remote Procedure Call (RPC) semantics are used because of its programmer friendliness and familiarity. Some of the traditional RPC features, especially transparency towards local procedure calls, are downplayed to better support the reliability and timeliness aspects required of a power grid control scheme. Reliability concerns are addressed by providing three redundancy schemes, ACK/resend, transmitting multiple copies of a single packet, and spatial redundancy through GridStat s redundant routing paths feature. ACK/resend represents a tradeoff between the timeliness and the reliability of the call, and multiple resends and redundant paths trades off reliability for network resources. Since the desired tradeoff parameters might vary between applications, Ratatoskr exposes these parameters to the programmer, along with other QoS properties of the call. Further, Ratatoskr allows pre- and post-conditions, which are predicate expressions, to be placed on the procedure calls. Pre-conditions are evaluated before the execution of a call, and will abort the call if the expression is not satisfied. Post-conditions are evaluated after the exectution of a call and the result returned back to the client application to indicate system state. Pre- and post-conditions in Ratatoskr may use status variables published to GridStat in the expressions, accommodating usage of data from remote locations. These predicates are built into the call semantics, providing standardized usage patterns, simplifying reuse and providing the option of 1 In norse mythology, Ratatoskr is a squirrel running around the great life-tree Yddgarsil, carrying insults between mythological creatures living on the branches. 4

16 delayed execution of post-conditions. Pre-conditions are tested before a call is carried out on the server side, aborting execution if the expression fails. Calls may then verify a safe system state before potentially dangerous operations, such as avoiding re-energizing a line if manned maintenance is scheduled in a endpoint substation at the time. Post conditions are carried out on the server after a call has completed, possibly after a specified delay. This allows grid programmers to verify the effects of operations. Power grid field sites often contain various mechanical devices which affect each other in complex ways, and the outcome of an operation could be unexpected even if the operation itself was successful. 1.4 Contributions of Thesis The research contributions of this thesis are: Design and implementation of a novel control scheme for an electical power grid environment where remote procedure calls are transported over a QoS enabled one-way publish subscribe middleware network (GridStat). Design and implementation of three distinctive techniques for redundancy, offering a tradeoff between worst-case deadline, use of network resources and resiliency towards a variety of network failure categories. Applications are allowed fine control of redundancy semantics. Design and implementation of pre- and post- conditions mechanisms designed into RPC semantics provides additional functionality over application-level implementation and allows for a standardized mechanism for control signals between utilities. An experimental evaluation quantifying the tradeoffs between the redundancy techniques and their performance. 5

17 1.5 Organization of Thesis The rest of this thesis is organized as follows: Chapter 2 summarizes related work and gives an introduction to GridStat required for understanding the contributions of this work. An overview of the Ratatoskr RPC mechanism and its underlying transport protocol is found in chapter 3. Chapter 4 details the design of a prototype implementation. Chapter 5 presents the findings of an experimental evaluation of the prototype. Finally, chapter 6 provides a summary of future work and the conclusion. 6

18 CHAPTER TWO BACKGROUND AND RELATED WORK This chapter gives an overview of relevant technologies, an overview of the GridStat framework architecture and details on the GridStat design related to the Ratatoskr mechanism. A more detailed introduction to GridStat can be found in [8]. 2.1 Middleware Distributed computing involves processes on separate machines cooperating, commonly over a network. If there are differences in the runtime environments of the interacting processes, such as data representation, some sort of translation must be performed between processes to ensure correct interaction. Middleware is software layered between the OS and the application offering abstractions to inter-process interactions and providing any needed translation services between process environments. Many different types of middleware interaction styles exist, accommodating a wide range of distributed system architectures. 2.2 Remote Procedure Call Remote Procedure Call (RPC), first presented in [4], is a style of middleware providing abstractions for remote execution of code in a client-server fashion. Client applications call remote procedures through an interface similar in syntax to local procedures at the client, and the RPC mechanism handles packing the call with parameters and sending it over the network, executing the code corresponding to the call at the server, and transmitting the result back to the client application. Remote procedure calls allow for return values in spite of the traditional sense of procedure as a returnless call. RPC calls are in nature synchronous and blocking. A frequent design goal in RPC systems has been to make remote calls indistinguishable from local calls both in syntax and semantics, although the latter has been shown to be impossible, [30]. 7

19 2.2.1 Failure Semantics Opposed to local procedures, a remote procedure call may fail during remote operation while the local client process remains operating correctly. Such failures could stem from errors during network transfer or failure during server execution. The failure semantics of an RPC mechanism is defined by the way remote failures are handled and the guarantees of successful execution provided to the client application. As any network in practice can be made reliable by resending messages until an acknowledgment (ACK) is received, there are mainly three schools of thought for failure semantics, [28]: At-least once - Provides guarantee that an RPC procedure is successfully executed given eventually reliable communication, but allows for repeated executions of the same call. This may be achieved by having the client repeatedly send a call until a result is received. The server executes all calls, no matter if they have been executed before, and sends results upon successful execution. This provides a strong guarantee, but at-least once is only practical for idempotent procedures. At-most once - Provides guarantee that execution of an RPC procedure is attempted exactly once at server given eventually reliable communication, but does not guarantee that the attempted execution is successful. A client retries sending a call until it receives a response from the server. To ensure that the call is attempted at most once redundant calls are filtered at the server, possibly using logs in stable storage to retain filtering after server crash. The server must respond negatively to filtered calls so the client knows when to stop sending. When the client receives a negative response, the execution status of the call is uncertain. Exactly once - Provides a guarantee that the RPC is executed exactly once at the server, and so is the ideal case. This is impossible in the general RPC paradigm, as the RPC mechanism is active only before and after application-level execution of a call on the server, and thus cannot infer about the success of execution if server fails between these, [29]. This can in 8

20 some cases be resolved through cooperation with the overlying application, but this must be at the expense of programmability, mechanism complexity and frequent writes to stable storage, and is seldom used in practice. While the beforementioned paradigms ideally rely on an eventually reliable network, it is often not practical to resend messages for an infinite number of times until success. The solution is most often to utilize no-loss transport protocols, that is transport protocols performing sends using ACK/retry schemes and that report back the delivery status of the send. While this type of transport protocol gives a high probability of delivery even over a faulty network, the overhead and high duration bound of such sends has given rise to a subdivision of at-most once semantics. Maybe once semantics provide zero-or-once execution semantics, but distinguishes from regular at-most once in that the underlying network sends do not ACK and so does not resend. This best-effort communication scheme provides a lower bound for end-to-end calltimes, and has little overhead, but at the cost of low reliability compared to regular at-least-once CORBA Common Object Request Broker Architecture (CORBA) is a comprehensive standard for interoperability between distributed object frameworks, [9]. Distributed objects are processes offering remote execution that are treated as abstract objects to separate the remote execution interface from the underlying implementation and platform. While CORBA is not strictly an RPC mechanism, the most common mechanism for making calls to distributed objects is so close to RPC in both syntax and semantics that it is relevant for this thesis. Many extensions to CORBA have been proposed, among them extensions targeting real-time operation, [11], and fault-tolerance 1, [10]. CORBA allows for the use of any underlying transport protocol, but dynamic configuration of communication protocols are not standardized and left to be specified by vendors, [24]. 1 It should be noted that Fault Tolerant CORBA focuses on fault tolerance through replication of services, while Ratatoskr focuses on replication of communication. 9

21 Real-time CORBA Real-time CORBA is an extension to CORBA for interoperability between frameworks accomodating real-time distributed systems. The extensions emphasize resource management in addition to the introduction of extensive call prioritization semantics including mapping to OS thread prioritization. Real-time CORBA supports setting transport protocol QoS properties upon object binding, [26]. This allows setting policies per invocation by rebinding for each invocation. Real-time CORBA is a mature standard with several field-tested implementations. For example, the TAO orb is being used for operation flight programs by the Boing corporation, [27]. Two strategies for using existing implementations of Real-time CORBA for actuator control in the power-grid would be to route Real-time CORBA traffic directly on top of utility networks, or to route Real-time CORBA traffic over a middleware layer that overcomes incompabilities. An alternative to using Ratatoskr over a GridStat for actuator control is to employ real-time CORBA on top of QoS aware networking technologies, such as ATM or diffserv IP. Such a real-time CORBA approach would provide timely control messages. Further, network level faulttolerance may be achieved by using multiple temporally redundant sends of each network packet. In addition to temporal redundancy, Ratatoskr uses the GridStat redundant paths feature to provide fault tolerance against network faults. In chapter 5, an evaluation of the performance of the fault-tolerance capabilities of Ratatoskr shows that redundant path routing provides fault tolerance against certain fault categories that affect all temporally redundant sends along a single path. We are not aware of any wide-area network technology providing routing with redunant paths. While this thesis presents an RPC mechanism designed specifically for actuator control over a GridStat connection, an alternative approach would be to implement a transport protocol enabling Real-time CORBA to communicate over GridStat. Where Ratatoskr is a pure RPC system, CORBA provides the advantages of a distributed object architecture, and compability to a large set of existing third party software components. Since Real-time CORBA extends the complex 10

22 CORBA standard, it requires adherence to a set of standardized semantics. While some requirements are provided in [6] and [17], the desired functionality of a power-grid control system is still largely unmapped and could potentially gain from mehcanisms not compatible the CORBA standard. The more minimalistic Ratatoskr design allows for rapid experimentation with features such as pre- and post-conditions and fine grained QoS semantics. Further, the communication subsystem of Ratatoskr can easily be adapted to carry Real-time CORBA traffic instead of Ratatoskr RPC calls, if Real-time CORBA is deemed desirable for a grid deployment Fault-tolerance in CORBA The distributed object paradigm architecture of CORBA lends itself well to service replication. As the distributed object interface is decoupled from the underlying implementation and environment, an object interface can be replicated into several implementations running in separate environments with minimum impact on observed behavior. Several CORBA implementations provide replicated objects, [23, 25, 20]. A replicated distributed object scheme, coupled with a real-time CORBA implementation, would provide timely delivery and fault-tolerance. Such a scheme would still have to rely on a the underlying network for network-level fault tolerance, and would not be able to reap the benefits of redundant path routing. Further, object replication has to rely on strong multicast guarantees for synchronization between replicas, which gives a high worst-case message rounds in face of communication failures and thus scales badly with geographical distance. 2.3 Publish/Subscribe The Publish/Subscribe middleware architecture centers around producers of information (publishers) and information consumers (subscribers). Publishers make information events available to a middleware network, and subscribers can request that events be forwarded to them by the network. The network forwards only subscribed data and can often optimize delivery paths through multicast, conserving bandwidth, [3]. The information flow is one-way; subscribers make subscription 11

23 requests to the middleware network itself rather than the publishers, allowing a decoupling between data producers and consumers. Further, published events can be stored in the network until the subscribers are ready to consume them, allowing a decoupling between publishing time and delivery to the subscriber, [7] Status Dissemination Status dissemination is a specialization of the publish/subscribe paradigm where publishers maintain status variables, [8]. Status variables are published values of a given type that are updated by publishing status events. Status events are limited by a maximum rate, and these restrictions in publication rate and type allow for additional QoS semantics compared to publish-subscribe systems without such restrictions. 2.4 GridStat This section presents an overview of GridStat s architecture, and details the design of modules relevant to Ratatoskr. The purpose of this overview is to provide a background for the rest of the thesis. A more complete introduction to gridstat can be found in [8] and [2] Architecture The GridStat architecture is separated into two main subsystems, the data plane, a middleware databus where status updates supplied by publishers are forwarded to subscribers, and the managment plane, a set of servers that manages system resources and organizes subscriptions by receiving subscription requests from subscribers and configuring the data plane towards forwarding accordingly. GridStat uses two kinds of communication traffic: Data traffic is always forwarded through the data plane message bus; control traffic between GridStat entities can be sent over any middleware control mechanism. The current implementation of GridStat uses CORBA and Ratatoskr as control message mechanisms. 12

24 Forwarding in the data plane is perfomed by status routers, middleware routers placed throughout a wide area network. Status routers form an overlay network by forwarding status events from router to router. The status routers retain implementations of all protocols used in the wide area network, and may function as bridges between the parts of the network using different networking technologies or with separate addressing spaces. Network connections in the data plane (from publishers and subscribers to status routers and betweens status routers) are represented as event channels that contain abstractions of data forwarding properties required for resource managment. Each publisher and subscriber has event channels to one or more status routers. Whereas the data plane has a flat organization, the managment plane consists of a hierarchy of servers called QoS brokers. QoS brokers in the lowest level of the hierarchy are leaf QoS brokers, and are the only QoS brokers that directly communicate with entities in the data plane. QoS brokers above the leaf level are called internal QoS brokers and act as the sole parent QoS Broker of one or more child QoS brokers. All QoS brokers have a parent, with the exception of the root QoS broker, and leaf QoS brokers do not have child QoS brokers. Each QoS broker is associated with a set of entities in the data plane, the QoS broker s cloud. The data plane is divided up so each status router belongs to the cloud of exactly one leaf QoS broker. Status routers that have event channels to the same publisher or subscriber must be in the same cloud, and publishers and subscribers belong to the same cloud as their status routers. The clouds of internal QoS brokers are defined as the union of the clouds of their children, and thus the cloud of the root broker is all entities in the data plane. Entities are named according to their relationship to the managment plane hierarchy. A GridStat element must have a name unique within the scope of its parent; its full name is the name within the scope with an added prefix of the parent s name. This hierarchy of clouds is meant to correspond to a natural organization of managment domains in the power grid, such as levels of geographical areas. As the data plane provides bounded delay and other QoS guarantees for subscription data, additional subscriptions must not overload network resources. The managment plane administers 13

25 the use of resources in the data plane, and so handles subscription requests. Subscription requests are made by the subscriber to its leaf QoS broker. If both the publisher and the subscriber of a new subscription are within the leaf level QoS broker s cloud, the leaf-level QoS broker is responsible for verifying that the connection will not overload network resources and update the status routers with the new subscription. If the publisher and subscribers are in different leaf-level clouds, the subscription request is propagated up in the hierarchy to the first QoS broker that has both within its cloud. Ratatoskr is build on top of GridStat subscription paths, and the most relevant GridStat modules in the context of this thesis are the publisher and the subscriber Publisher A publisher is a GridStat entity in the form of a module residing in an application program for publishing data to a GridStat network. It retains two connections to each of its status routers, an event channel for forwarding published status updates, and a middleware control channel for control messages that the status router forwards to the managment plane. The application can announce a new published variable through the module interface by providing a string name as identifier, a type, and the rate at which it is published. There is currently no policing on the maximum and minimum rates of publish updates. The managment hierarchy returns a 32-bit integer for identifying the variable within the GridStat network, a variableid. The application may update the value of a status variable through the module interface by specifying the variableid and the new value. The types of variables provide semantics for subscribed events, in addition to additional functionality outside the context of this thesis. The current types are various primary types (integer, floating point, bool...) and a user defined type, which is treated as a simple byte array by GridStat. The user defined type contains semantics for division into further subtypes, defined by the application. 14

26 Subscriber The subscriber is a GridStat entity module used by applications to subscribe to data published over the GridStat network by a publisher. Similar to the publisher, the subscriber also retains two channels to each of its status routers: An event channel for receiving subscribed updates and a control channel for subscribing or unsubscribing to status variables. To subscribe to a published status variable, the application passes the variable name, the name of the publisher, QoS parameters and a SubscriptionHolder, an object that stores the status value and is updated by the subscriber when it receives updated values from its status router. Applications can access the values directly through the SubscriptionHolder interface, or can specify a callback method that will be invoked when the SubscriptonHolder is updated. There are several implmentations of SubscriptionHolders corresponding to the types of status variables, and applications can provide additional implementations for added functionality, or for semantics supporting subtypes of user defined variables. GridStat allows subscribers to specify that subscription data should be sent over redundant paths. Subscriptions over redundant paths are sent through more than one path in the GridStat network, where, with the exceptions of Entry-point SRs, a status router or event channel present in one path is not present in any other paths. 15

27 CHAPTER THREE THE RATATOSKR RPC MECHANISM GridStat s mission is to provide a complete communication framework for the power-grid. In addition to the existing publish-subscribe functionality, a standardized control-mechanism is needed for allowing power-utilities to control field equipment through the GridStat infrastructure. Such a mechanism will have to accommodate timely execution and high fault tolerance due to the critical nature of Grid operation. Ratatoskr is an RPC mechanism designed to run on top of GridStat s publish-subscribe system, utilizing the QoS mechanisms provided by GridStat. Built into the RPC semantics are pre- and post-conditions on calls, intended for predicates over GridStat published variables. Ratatoskr s intended primary use is for control-center operators and mechanisms to send control-messages to actuators in substations, either directly accessing actuators or through an intermediary RPC server that can communicate with actuators through legacy APIs. This chapter gives an overview of the features of Ratatoskr. 3.1 Definition of terms The parts of the text regarding the transport protocol uses terms as defined in [18]. Additional terms are defined below. 2WoPS transport protocol - 2-Way over Publish Subscribe. Communication protocol defining two-way communication over two GridStat one-way subscription paths. 2WoPS peer - An application connected to a GridStat framework that utilizes the 2WoPS protocol for two-way communication using a GridStat publisher for sending data and a GridStat subscriber for receiving data. Ratatoskr peer - A device connected to a GridStat framework that utilizes Ratatoskr RPC for communication. 16

28 Entry-point SR - The GridStat status-router a publisher or subscriber connects to. When used in relation to a 2WoPS peer, the entry-point SR signifies the edge status-router used to connect both the publisher and subscriber of the 2WoPS peer. The current implementation of GridStat allows publishers and subscribers to connect only to a single status router, while the architecture allows for multiple connections. The rest of this thesis considers only the case of a single entry-point SR per publisher or subscriber, as the exact semantics of multiple entry-point SRs are still undefined. TSDU - Transport Services Data Unit, a chunk of data from an overlying application that is sent through a transport layer connection. transport protocol control message - Similar to a TSDU, but data is for control of the 2WoPS protocol, not for application use. TPDU - Transport Protocol Data Unit, a chunk of data from the transport layer that is sent over a network layer connection. In this context, GridStat pub/sub communication is seen as a network layer. A TPDU can be a TSDU with added transport layer headers, or data used exclusively for control information by the transport layer. Several TPDUs can duplicate the same TSDU, and a single TPDU can be spread over multiple TSDUs, although the latter is not implemented in the prototype (see section ). NSDU, NPDU - Network Service Data Unit and Network Protocol Data Unit, similar to TSDU and TPDU but for the network layer (GridStat pub-sub). A NSDU is exactly the same data as a corresponding TPDU, but viewed in context of the network protocol layer. An NSDU with an added network-layer header is an NPDU. 17

29 3.2 Two-way Communication over a Publish-Subscribe Framework GridStat is a publish-subscribe system. Publishers in the system make data in form of status updates available to the GridStat framework. Subscribers may request subscriptions to these variables, and GridStat will forward subscribed information from publishers to subscribers according to QoS properties specified at subscription time. Communication is strictly one-way; subscribers have no way of sending information to publishers. RPC communication requires a two-way communication as procedure calls often will return values to the client, and acknowledgments on successful calls are almost universally required even when the call has no return value. To allow for a two-way communication link to be established, Ratatoskr utilizes a transport protocol called the 2WoPS protocol on top of GridStat. The 2WoPS protocol achieves two way communication by instantiating both a publisher and a subscriber behind a single interface. To set up a two-way data path, two 2WoPS peers each publish a data variable specific to the session, and subscribe to the other peer s corresponding variable. Data is sent over the connection by publishing a status update containing the data, and received by the other peer through the subscriber interface. The 2WoPS interface masks the publisher and subscriber behavior. Using a layered approach to communication allows for other uses than Ratatoskr RPC traffic of the 2WoPS protocol. For example, the 2WoPS protocol was used for control communication between QoS Brokers in [1]. Figure 3.1 shows the relationship between the modules of Ratatoskr (light shade), the GridStat modules used by Ratatoskr (dark shade), and examples of potential other applications using GridStat or Ratatoskr modules (white). The example shows the architecture stack for a control center and a substation. The main intended use of Ratatoskr is illustrated by the control center control system using Ratatoskr RPC to execute control operations on an actuator in the substation. Other uses of the 2WoPS protocol may be to transport legacy control messages to actuators if the actuator API remains to be fully implemented for Ratatoskr. The publisher and subscriber used by the 2WoPS protocol may have other uses, such as sending sensor data from the 18

30 substation to the control center, or publishing reports of power-grid state aggregated in the control center to be used by protection schemes in the substation. Finally, while GridStat requires control of the underlying network resources, network technologies that manage resource use may reserve bandwidth for uses outside GridStat, such as transferring video feeds from surveillance cameras in the substation Properties of the 2WoPS Protocol The 2WoPS protocol is designed specifically for the Ratatoskr RPC. While this does not block out other uses for two-way communication over GridStat, care should be taken in noting the properties of the protocol, as these differ from the most common transport protocols, TCP and UDP. Some suggested extensions to the protocol to enhance use for other applications can be found in section This section gives a summary the main properties of the 2WoPS protocol. Connection oriented - This was a necessary design decision as the underlying GridStat communication is connection-oriented. The 2WoPS protocol interface provides method to open and close a connection. Controlled-loss - An adjustable ACK/resend scheme similar to the k XMIT scheme found in [21]. A TSDU will be retransmitted up to k times, where k is a user specified number. No ACK status is sent by the server on the k-th resend. This reduces the deadline for the sending process by the time for sending the ACK, at the expense of knowledge of the delivery status. It should be noted that while a missing ACK suggests that the message was not delivered, it cannot guarantee a failed delivery, as the message might have arrived while the ACK was lost. Because delivery status is unclear, the overlying RPC mechanism must still wait for a return from the server. When k is set to 0 the scheme has uncontrolled-loss properties. The controlled loss scheme gives little indication of the success of a call, which might be impractical for non-rpc use, so a no-loss scheme is also provided. 19

31 Figure 3.1: Ratatoskr Module Stack No-loss - adjustable ACK/resend scheme. Similarly to controlled-loss, TSDUs are retransmitted up to k times, only the no-loss scheme delivers an ACK even on the final send. For no-loss, delivery of a TSDU is uncertain only if each send attempt experiences faults, while for controlled-loss, delivery of a TSDU is uncertain even if the k-th send-attempt experiences no faults. This gives weaker failure semantics for controlled-loss, and more so at a low k. Controlled loss blocks 2 k 1 trip-times per send and uses 2k 1 TSDU-transfers of bandwidth where no-loss blocks for 2 k trip-times and uses 2k TSDU-transfers per send. Timeliness - GridStat provides delivery guarantees for subscriptions. The delivery guarantees of the underlying subscriptions are used to calculate tight timeout values for ack/resends, and delivery guarantees for TSDU sends. Blocking - Execution of a sending thread is blocked until the send is completed. A send is completed either when delivery is confirmed by receiving an ACK from the receiver, when the k-th ack times out for no-loss, or after the k-th send for controlled-loss. Multiple threads are still allowed to send in parallel. 20

32 Unordered delivery - No message ordering is provided. Received NSDUs containing TSDUs are delivered to the application in the order they were delivered to the 2WoPS protocol by GridStat. No duplicates - TPDUs duplicating the same TSDU are filtered so the TSDU is delivered only once to the server application. Error control - A simple cyclic redundancy check (CRC) is used to discard TPDUs containing bit errors. Hierarchical naming - A naming scheme similar to the one for publishers and subscribers in GridStat is used. A 2WoPS peer is identified within its GridStat cloud by a string s with no spaces. The peer registers the publisher and subscriber used for communication with names based on this string, the publisher is named spub and the subscriber ssub. The names of the publishers and subscribers must be locally unique, that is no other 2WoPS peer may have the same name within the leaf-level cloud of the entry-point, and since clouds have unique names the fully qualified name is globally unique. A leaf-qos broker stores the names of all elements in its cloud and prevents registry of locally non-unique names. Message oriented - TSDUs are bounded by the maximum size of GridStat status updates, which is again bounded by an underlying transport protocol (UDP for the research prototype of GridStat) Reliability Measures A serious concern in any wide area network is that the number of components, geographical outstretch, and usage patterns of such networks inevitably lead to lowered reliability when compared to local area networks. This is especially apparent in the Internet, where most traffic uses the TCP transport protocol which uses TPDU drops to indicate congestion so it can regulate bandwidth usage. While GridStat controls network traffic at the network edges to avoid network overload 21

33 at least during normal operation, a GridStat deployment must be expected to share many of the loss properties of the internet stemming from other sources than traffic overload. These include hardware failure, maintenance, line damage or short-term miscommunication between routers. A 2002 study on an internet backbone found that with respect to mean failure rate, the median link failed every ten days, [16]. The mean failure had a duration of over one minute and 10% over 20 minutes. Such failure patterns are acceptable in the Internet because routing protocols will discover link errors and reconfigure routing to direct traffic around the affected links in a manner of seconds, and because few Internet applications depend on high network reliability. Also, while network drop rates during to transfer are negligible in the fiber and copper lines common in wide area networks today, GridStat is an overlay network and underlying physical network technologies might display other properties. Connecting remote substations to a utility network by fibre is expensive, and alternatives include microwave signaling, WiFi, power-line communications or satellite, all suffering from various forms of signal interference. The 2WoPS protocol provides several kinds of redundancy to overcome network failures Reliability Techniques in the 2WoPS Protocol The 2WoPS protocol employs three techniques for overcoming network losses: ACK/resend: allows specially marked TPDUs to be ACKed back to the sender, enabling the sender to resend the TPDU until it is confirmed successfully sent. ACK/resend is allowed for TPDUs containing TSDUs, enabling ACK/resend semantics on application messages. If an ACK is lost the sender will not be aware of delivery success and resend the TSDU, so redundant TSDUs must be filtered at the receiver. This technique guarantees successful delivery given an unlimited number of resends and an eventually-consistent network connection. Further, the technique uses a very limited amount of bandwidth to achieve fault tolerance. The main disadvantage with the technique is that the sender must wait a full RTT before a packet is confirmed lost and resend is commenced, and so the time for successful 22

ADAPTIVE GRIDSTAT INFORMATION FLOW MECHANISMS AND MANAGEMENT FOR POWER GRID CONTINGENCIES

ADAPTIVE GRIDSTAT INFORMATION FLOW MECHANISMS AND MANAGEMENT FOR POWER GRID CONTINGENCIES By STIAN FEDJE ABELSEN A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF