An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks *

Size: px

Start display at page:

Download "An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks *"

Dustin Greer
5 years ago
Views:

1 An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in Networks * GUILLERMO RODRÍGUEZ-NAVAS and JULIÁN PROENZA Departament de Matemàtiques i Informàtica Universitat de les Illes Balears Ed. Anselm Turmeda, Campus UIB, Palma de Mallorca SPAIN Abstract: - Although the Controller Area Network () protocol is increasingly used for real-time critical applications, its original specification does not provide a clock synchronization service. In this paper we introduce the architecture of a clock subsystem that provides any system with a clock synchronized with high precision. The main advantage of our subsystem is that, unlike what the previous solutions do, it provides this service without replacing the circuitry nor importantly changing the software of the nodes. For this reason, we consider our subsystem as orthogonal to the rest of the network. Another advantage of our clock subsystem is the presence of specific fault tolerance mechanisms, which improve those existent in previous solutions. As a result, our subsystem is able to tolerate its own faults without affecting the nodes of the system. Key-Words: - Distributed embedded systems, Real time, Clock synchronization, Controller Area Network, Fault tolerance 1 Introduction The Controller Area Network () protocol [5] is a serial bus that is being used in a wide range of applications, including automotive and industrial automation, because of its reliability, real-time performance [9], and low cost. A network consists of a set of nodes that interchange messages following the protocol in order to work cooperatively. As Fig. 1 shows, the nodes of a network are constituted of three basic elements: a processor, typically a microcontroller, which executes the application software; a controller, which implements most of the protocol; and a, which simply adapts the transmission and reception signals to the communication medium. An increasing number of networks require their processors to have a synchronized clock. However, as the protocol does not include such service, this has to be provided by implementing a synchronization algorithm at the application layer. Those synchronization algorithms are typically implemented in the software executed by the processor [3, 7, 10], though this kind of implementation has some disadvantages. First, it requires important modifications in the software of the processor, which have an inherent cost and complexity. Second, since the interface between the processor and the controller has not been standardized yet, this software-implemented service of clock synchronization is not completely independent of the controller. Then, if the controller was replaced, the software should be importantly modified again. And finally, as the synchronization information is processed by software in the processor and not at the hardware level, then there are significant latencies that impede to achieve a high precision in the clock synchronization. node controller -bus node controller Figure 1. Architecture of a network In addition to software implementations, the hardware implementation of the Time-Triggered protocol (TT) [2, 4] is argued for being a solution to high-precision clock synchronization in * This work has been supported by the spanish MCYT grant DPI C03-02, which is partially funded by the European Union FEDER program.

2 networks. TT is a higher-layer extension of the protocol that operates the network in a time-triggered mode and, due to this, incorporates a service for clock synchronization. Nevertheless, the hardware implementation of the TT protocol also has some problems that discourage its use in many networks. First, it requires the replacement of the standard controllers by TT controllers, which implies that the software of the processor has to be importantly modified. And second, the clock synchronization in TT is only provided when the network operates in time-triggered mode. Therefore, this service is not compatible with most of the software already developed for standard networks, which assumes event-triggered operation mode. From the above discussion, it can be concluded that all the current solutions for clock synchronization in networks, both hardware and software, have important disadvantages. The work presented in this paper is aimed at overcoming this by designing a subsystem that provides a clock service of high precision that is independent of the used controller, and that implies only minor changes in the software of the processor. The architecture of this subsystem consists of a set of additional hardware modules, which we have named clock units. As depicted in Fig. 2, each one of these clock units is attached to a different node of the network. The function of the clock unit is to provide the processor of its node with a clock that is transparently synchronized with the clocks of the other clock units of the network. This clock is kept in a register that is mapped into the memory of the processor. Then, the software of the processor only has to be slightly modified in order to be able to read the value of this synchronized clock. In addition, the clock units communicate independently of their nodes controllers because they incorporate their own circuitry for sending and receiving messages. Moreover, they are connected to the network through their own s (Fig. 2). Due to this orthogonality to the rest of the network, the entire set of clock units with their corresponding s can be seen as an independent subsystem. This subsystem will be called clock subsystem hereafter. Besides its orthogonality, another important property of the clock subsystem we have defined is its fault-tolerant behaviour. This property is achieved by implementing some mechanisms that will be described later on. In the rest of the paper, we will present the complete architecture of our clock subsystem. In section 2, the basic features of the clock unit will be explained. Section 3 will be devoted to describing the mechanisms that have been included in the clock units in order to achieve fault-tolerance. In the last section, the work presented in this paper will be summarized. Clock unit node -bus controller Figure 2. A node that incorporates the circuitry of our clock subsystem 2 Basic features of the clock unit As indicated above, the function of each clock unit is to provide its node s processor with a transparently synchronized clock. In order to simplify the hardware design of the clock unit, all the functions that it performs have been grouped into three modules. These modules are named global clock register, synchronization module and module (Fig. 3). The function of each module is explained next. The global clock register contains the value of the synchronized clock. This register is mapped into the memory of the processor. In this way, the application software is able to easily read it. The synchronization module is aimed at maintaining the global clock register synchronized with the global clock registers of the other clock units. To perform that function, this module incorporates an algorithm for clock synchronization. Although any algorithm could have been incorporated to this module, we decided to incorporate an algorithm based on the one defined in the specification of the TT protocol, because of its high precision and efficiency. However, we have included some modifications to this algorithm that have improved the fault-tolerance of the synchronization. At this point, only the basics of the synchronization algorithm used by TT [2, 4] are described. In section 3, the modifications that we have made in this algorithm will be further explained. The algorithm for clock synchronization defined by TT is based on a centralized scheme of synchronization, since one of the nodes (the time master) is assumed to have a correct view of the time, and the rest of the nodes (the slaves) simply accept its view. The mechanism for spreading the time view of

3 the time master works as follows. Each node (including the time master) takes a sample of its clock at the sample point [5] of the SOF bit of any message. This sampling is done almost simultaneously by all the nodes, due to the in-bit response of the protocol. Thanks to this simultaneity, each slave can compare its sample to the one of the time master in order to know whether they are synchronized or not. To allow this comparison, the time master periodically sends, within the data field of a reference message, the sample that it has just taken in the SOF bit of the reference message itself. Thus, all the slaves can update their clocks in order to synchronize to the time master s clock and, moreover, they can calculate their drift with respect to the time master and correct it. As remarked in [2], this algorithm achieves a precision in the order of one bit time (e.g. in the order of 1 µsec when works at 1Mbps). As TT operates in a time-triggered mode, the automatic retransmission of erroneous messages is disabled in order to prevent a retransmitted message from interfering the time slot of other nodes. However, the retransmission of erroneous messages is a standard feature of the protocol that allows the nodes to recover from errors in the channel. Therefore, and in order to make our solution compatible with any network, the synchronization algorithm we have included in our clock module does assume that the erroneous messages are automatically retransmitted. Since this difference fundamentally concerns the fault tolerance of the synchronization, it will be further discussed in section 3. Global clock register Synchronization module module Figure 3. Structure of a clock unit The module includes all the necessary circuitry to send and receive messages. This makes each clock unit independent of its node s controller. A standard core that is, the standard circuitry that implements the basic features of a controller cannot be used instead of our module. This is due to the fact that the algorithm for clock synchronization incorporated in the synchronization module requires some services that a standard core does not provide. Specifically, two additional services are required: one to indicate the sampling-point of the first bit of any message, and another to allow the time master to write in the data field of the reference message while it is being transmitted. It is important to remark that, even though those new services are required, the circuitry of our module has lower complexity than the circuitry of a standard core. This is thanks to the fact that the clock units only interchange a single message (the reference message), and therefore the circuitry for managing reception and transmission buffers can be importantly simplified. 3 Fault-tolerant synchronization As explained in section 2, our clock subsystem uses a centralized scheme of synchronization. In principle, there is only a clock unit that performs the function of time master and the rest of clock units are slaves. However, a disadvantage of this scheme is that the time master becomes a single point of failure of the clock subsystem, since whenever it has a fault then the entire process of synchronization does not work. To avoid this situation, we need to provide the clock subsystem with some mechanisms that allow the faults of the time master to be tolerated. Since the time master is also a single point of failure in the synchronization algorithm of TT, we initially considered including the mechanisms that this protocol provides to tolerate faults of the time master. However, an in-depth analysis of such mechanisms showed that they do not completely solve the issue of fault tolerance and, hence, they could not be directly used in our clock units. As a consequence, we decided to design some new mechanisms, inspired in the ones used by TT, that actually provide our clock units with a suitable fault tolerance. In this section, we will first present and analyze the mechanisms for fault tolerance defined by TT. After that, we will introduce the new mechanisms we have designed. 3.1 Tolerance to faults of the time master In order to tolerate faults of the time master, TT declares a number of nodes as replicated time masters (called spare time masters). The function of these replicated time masters is to substitute the main time master whenever it fails. The spare time masters perform the detection of errors of the time master as follows. They know when the main time master has to send its reference message. And, as this message cannot be interfered by others because of the timetriggered operation of TT, then they can also determine when it should be received. Thus, they consider an error of the main time master whenever

4 the reference message is not received on time. Note that, since TT does not support the automatic retransmission of erroneous messages, this approach allows a channel error that happens during the transmission of the reference message to be incorrectly taken as a failure of the main time master. Once a spare time master detects an error of the main time master (i.e. an omission of the reference message), it performs an error-recovery mechanism, which simply consists in transmitting its own reference message. In that way, the network can continue its normal operation. Since all the spare time masters are able to detect the omissions of the reference message, more than one time master may simultaneously try to send the reference message. However, in TT each time master sends a message with a different identifier. Then, the arbitration mechanism [5] solves this conflict by causing the time master with the highest priority identifier to succeed in sending the reference message. After that, the spare time master that has won the arbitration becomes the main time master. In order to tolerate faults of the time master, our clock subsystem includes some of the mechanisms used by TT. First, we have included time master redundancy, since a number of clock units are declared as spare time masters. And, second, we have included the error-recovery mechanism, since the spare time masters are also in charge of sending the reference message when the main time master fails. Nevertheless, we have realized that the errordetection mechanism of TT has two important problems, and that, due to them, it cannot be incorporated without modification to our clock subsystem. The first problem of the error-detection in TT is that the time masters are assumed to fail only by not sending the reference message. In other words, an omission failure semantics [1] is assumed for the time masters. However, this assumption is not substantiated by the architecture of the TT nodes, since they are not provided with any mechanism that prevents the time masters from having other kind of failures, like performance failures or Byzantine failures (e.g. a faulty time master that takes the identity of the main time master and sends an absurd value of time). The second problem of the error-detection is related to the fact that only the errors of the main time master can be detected. Due to this, the actual availability of the spare time masters is unknown to the rest of the network. To solve the problems presented above, we have incorporated new mechanisms to our clock subsystem. First, we have designed a hardware structure and a distributed protocol that actually substantiates the assumption of omission failure semantics not only in the time masters, but also in every clock unit. These mechanisms will be described in section 3.3. In addition, we have defined a new error-detection mechanism, which allows all the clock units to have a consistent view of the actual state of the entire set of time masters. This mechanism is presented next. 3.2 Detecting errors of the time masters In our clock subsystem, only the errors of the main time master are theoretically detected. However, in practice we make all the time masters play the role of main time master one after the other, following a round-robin scheme. Therefore, the error of any time master can be detected by the entire set of clock units with a relatively short latency. The detection of errors is performed as follows. The main time master has to send the reference message at a given instant, which do all the clock units know. This reference message has the highest priority in the network. Separately, the spare time masters have to send their reference messages a little time interval after that instant. The length of this interval depends on the precision of the clock, as it should be long enough to guarantee that a spare time master cannot schedule the transmission of its reference message before the instant in which the main time master has to. If this condition is guaranteed, whenever the reference message of a spare time master is received instead of the one of the main time master, then it can be stated that the latter has not sent its reference message. Note that, due to the omission failure semantics of the clock units (and hence of the time masters), this mechanism allows the clock units to become aware of all the faults of the main time master, since any fault will manifest like an omission of the reference message. Of course, when the reference message of the main time master is received, all the spare time masters abort the transmission of their reference messages in order to avoid the transmission of useless messages. The consistency of this error-detection mechanism relies on the Atomic Broadcast property provided by the protocol [5], which guarantees that a message is received by all the nodes of the network or it is not received by any. This condition is enforced in our case by the fact that the clock units cannot fail in a way that implies a violation of the protocol, since they have an omission failure semantics. Although some contributions have probed that the protocol does not actually provide Atomic Broadcast in certain scenarios, we do consider that it is provided, as solutions to this problem have been already suggested [6, 8].

5 Once our system has been provided with this mechanism for consistently detecting errors in all the time masters, then a consistent membership service can be easily designed. The membership information kept by every clock unit is a penalty count of the errors that each time master has shown. Whenever an error of one time master is detected, then its penalty count is increased by a given amount. And, whenever a time master does not fail and sends its reference message, then its penalty count is decreased by a lower amount. If the penalty count of a time master reaches a certain threshold, then it is consistently considered as having a permanent failure. 3.3 Omission failure semantics In order to provide our clock units with an omission failure semantics we have designed two mechanisms. The first one is a hardware structure that detects any error caused by an internal fault in the clock unit and that, moreover, prevents those errors from spreading to the rest of the system by disconnecting the faulty clock unit from the rest of the system. The second one is a distributed protocol that allows the clock units to recover from an internal error. The hardware structure is presented next. After that, the distributed protocol will be described. Note that without this second mechanism, the clock unit would have crash failure semantics, since a single fault would lead the clock unit to be indefinitely disconnected. The new hardware structure consists of a duplicated clock unit with a comparison module that supervises its behaviour. This last module performs two different comparisons, a first one at the high level, between the global clock register kept by each replica of the clock unit, and a second one at the low level, between the frames sent by each replica of the module to the. Thus, whenever one of the replicas of the clock unit has a fault then it manifests as an error in some of these comparisons and is detected. Moreover, when such an error is detected, the comparison module prevents the duplicated clock unit from transmitting any data to the network by disabling its. To achieve an omission failure semantics, the duplicated clock unit that has detected an error, and therefore is disabled, must have the opportunity to recover. To do this, the duplicated clock unit is provided with an internal recovery mechanism. This recovery mechanism reestablishes a correct internal state in the duplicated clock units, when it is possible, and after that, it enables the again. However, this recovery mechanism cannot guarantee that the membership information kept by the duplicated clock unit is correct (since it could have been corrupted by the fault) nor consistent with the actual state of the system (since some reference messages could have been lost during the recovery). Therefore, we have designed an additional mechanism to obtain a correct and consistent membership information from the other time masters. Initially, we have considered two options for implementing this mechanism. Both are introduced next. The first option consisted in using a mechanism for requesting information from other nodes which is naturally provided by the protocol. This mechanism is the Remote frame [5]. However, although using this mechanism allows the time masters to recover in a short time, it has been rejected because it introduces a sporadic message, whose transmission and reception would significantly increase the complexity of the clock unit. In addition, this mechanism would also complicate the scheduling of messages in the system. The second option is the one incorporated in our subsystem. It consists in sending the membership information, together with the time value, within the data field of every reference message. The main advantage of this mechanism is its simplicity, as the membership information is sent in a periodical message that the time masters have already to send, and therefore it does not require important changes in the clock units. Thus, after reestablishing its internal state, a clock unit in the process of recovering only needs to wait until the reception of the next reference message. Once this message is received, the consistency of the membership information is guaranteed again. Thanks to this consistency, a justrecovered clock unit that is time master can know in which state the other clock units consider it is. Thus, if it is not considered in a permanent failure, then it joins the round robin, so that all the clock units will detect it has recovered as soon as it becomes the main time master and sends the reference message. On the contrary, if the rest of time masters consider it is in a permanent failure, then it does not join the round robin and does no longer send its reference message. Since a clock unit in the process to recover does not have a consistent membership information until a reference message is received, then it is not able to know the identity of the main time master. Therefore, if the main time master would fail and the reference message of a spare time master would be received, then this clock unit would not be able to increase the corresponding penalty count. In order to avoid this situation, the spare time masters send the membership information with the penalty count of the main time master already increased. Note that the correctness of this mechanism relies on the fact that the reference message sent by a spare time master only can be

6 received if the main time master has actually failed, as we explained in section 3.2. Sending the membership information within the reference message limits the number of time masters allowed in our clock subsystem. We have decided to use eight time masters, and codify the penalty count of each one of them with three bits. In this way, the membership information occupies three bytes in the data field of the reference message. The other bits can be used to send the time of the main time master and some control information. It is important to remark that whenever a clock unit (either time master or slave) has a failure, the precision of the clock that it provides to the processor of this node cannot be guaranteed until the hardware recovery function has been executed and two reference messages have been received [2, 4]. Thus, at the application level it should be decided whether the processor is able to work during this time with a clock of degraded precision. 4 Summary In this paper the architecture of a subsystem that solves the problem of high-precision clock synchronization in event-triggered networks has been introduced. This architecture consists of a set of additional modules, called clock units, which are attached to the nodes of the original network. Each clock unit is attached to a different node and provides the processor of this node with a clock transparently synchronized to the rest of clock units. Our solution is considered as orthogonal, in opposition to solutions previously suggested, because it is independent of the controller used by the nodes of the network, and because the processors of the nodes can use it without importantly changing the application software. Such orthogonality is desirable because it reduces the implementation cost. Although our architecture is compatible with any synchronization algorithm, we have decided to implement a similar algorithm to the one defined by the TT protocol, because of its high precision and efficiency. This algorithm uses a centralized scheme of synchronization, since a single node, which is called time master, is in charge of maintaining the synchronization. Due to this, some mechanisms that provide tolerance to faults of this time master are required. A significant contribution of the present work is that, as the mechanisms defined by TT to provide this fault tolerance were not properly solved, we have defined new mechanisms that actually provide a suitable tolerance to faults of the time master. As a consequence, our clock subsystem is able to tolerate its own faults without affecting the rest of the nodes and, therefore, it can be added to any network without decreasing the global reliability of the system. Moreover, we have designed a consistent membership service, which guarantees that the actual availability of all the time masters is consistently known by the entire set of clock units. References: [1] F. Cristian, Questions to ask when designing or attempting to understand a fault-tolerant distributed system, Keynote Address in Proc. 3 rd Brazilian Conference on Fault-tolerant Computing, Rio de Janeiro, Brazil, [2] T. Führer, B. Müller, W. Dieterle, F. Hartwich, R. Hugel, M. Walther, Robert Bosch GmbH, Time Triggered Communication on, Proceedings of the 7 th International Conference, Amsterdam, The Netherlands, [3] M. Gergeleit and H. Streich, Implementing a Distributed High-resolution Real-time Clock using the -bus, Proceedings of the 1 st International Conference, Mainz, Germany, [4] F. Hartwich, B. Müller, Th. Führer, R. Hugel, Robert Bosch GmbH, network with Time Triggered Communication, Proceedings of the 7 th International Conference, Amsterdam, The Netherlands, [5] ISO, Road vehicles Controller area network () Part 1: Controller area network data link layer and medium access control, [6] J. Proenza and J. Miro-Julia, Major: a Modification to the Controller Area Network Protocol to achieve Atomic Broadcast, IEEE Int. Workshop on Group Communications and Computations, Taipei, Taiwan, [7] L. Rodrígues, M. Guimarães and J. Rufino, Faulttolerant Clock Synchronization in, Proceedings of the 19 th IEEE Real-time Systems Simposium, Madrid, Spain, [8] J. Rufino, P. Verissimo, G. Arroz, C. Almeida, and L. Rodrígues, Fault-tolerant broadcasts in, Digest of papers, The 28 th IEEE International Symposium on Fault-Tolerant Computing, Munich, Germany, [9] K. Tindell and A. Burns, Guaranteeing Message Latencies on Controller Area Network (), Proceedings of the 1 st International Conference, Mainz, Germany, [10] K. Turski, A global time system for networks, Proceedings of the 1 st International Conference, Mainz, Germany, 1994.

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,