Fault-Tolerant Parallel and Distributed Systems

Size: px

Start display at page:

Download "Fault-Tolerant Parallel and Distributed Systems"

Calvin Shelton
5 years ago
Views:

1 Fault-Tolerant Parallel and Distributed Systems

2 Fault-Tolerant Parallel and Distributed Systems by DIMITER R. AVRESKY Department of Electrical and Computer Engineering Boston University Boston, MA and DAVID R. KAELI Department of Electrical and Computer Engineering Northeastern University Boston, MA.., ~ Springer Science+Business Media, LLC

3 ISBN ISBN (ebook) DOI / Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress. Copyright 1998 by Springer Science+Business Media New York Origioally published by Kluwer Academic Publishers in 1998 Softcover reprint ofthe hardcover lst edition 1998 AII rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission ofthe publisher, Kluwer Springer Science+Business Media, LLC. Printed on acid-free pa per.

4 Contents Preface ix Part I Fault-Tolerant Protocols 1. Comparing Synchronous and Asynchronous Group Communication 3 F. Cristian 2. Using Static Total Causal Ordering Protocols to Achieve 25 Ordered View Synchrony K.-Y. Siu and M. Iyer 3. A Failure-Aware Datagram Service 55 C. Fetzer and F. Cristian Part II Fault-Tolerant Distributed Systems Portable Checkpoint For Heterogeneous Architectures 73 V. Strum pen and B. Ramkumar 5. A Checkpointing-Recovery Scheme for 93 Domino-Free Distributed Systems F.Quaglia, B. Ciciani, and R. Baldoni 6. Overview of a Fault-Tolerant System 109 A. Pruscino 7. An Efficient Recoverable DSM on a Network of Workstations: 123 Design and Implementation A.-M. Kermarrec and C. Morin 8. Fault-Tolerant Issues of Local Area MultiProcessors (LAMP) 139 Storage Subsystem Q. Li, E. Hong, and A. Tsukerman 9.. Fault-Tolerance Issues in RDBMS on SCI-Based Local Area 155 MultiProcessor (LAMP) Q. Li, A. Tsukerman, and E. Hong Part III: Dependable Systems Distributed Safety-Critical Systems 173 P.J. Perrone and B. W. Johnson

5 vi 11. Dependability and Other Challenges in the Collision 195 Between Computing and Telecommunication Y. Levendel 12. A Unified Approach for the Synthesis of Scalable and Testable 213 Embedded Architectures P.B. Bhat, C. Aktouf, Y.K. Prasanna, S. Gupta, and M.A. Breuer 13. A Fault-Robust SPMD Architecture for 3D-TV Image Processing 231 A. Chiari, B. Ciciani, and M. Romero Part IV: Fault-Tolerant Parallel Systems A Parallel Algorithm for Embedding Complete 249 Binary Trees in Faulty Hypercubes S.B. Choi and A.K. Somani 15. Fault-Tolerant Broadcasting in a K-ary N-cube 267 B. Broeg and B. Bose 16. Fault Isolation and Diagnosis in Multiprocessor Systems with 285 Point-to-Point Communication Links K. Chakrabarty, M.G. Karpovsky, and L.B. Levitin 17. An Efficient Hardware Fault-Tolerant Technique 301 S.H. Hosseini, O.A. Abulnaja, and K. Vairavan 18. Reliability Evaluation of a Task Under a Hardware 315 Fault-Tolerant Technique O.A. Abulnaja, S.H. Hosseini, and K.. Vair 19. Fault Tolerance Measures for m-ary n-dimensional Hypercubes 329 Based on Forbidden Faulty Sets J. Wu and G. Guo 20. Dynamic Fault Recovery for Wormhole-Routed 341 Two-Dimensional Meshes D.R. Avresky and C.M. Cunningham 21. Fault-Tolerant Dynamic Task Scheduling Based on Dataflow Graphs E. Maehle and F.-J. Markus 22. A Novel Replication Technique for Implementing Fault-Tolerant Parallel Software A. Cheri/. M. Suzuki, and T. Katayama

6 23. User-Transparent Checkpoing and Restart for Parallel Computers 385 B. Bieker and E. Maehle Index 401 vii

7 Preface The most important use of computing in the future will be in the context of the global "digital convergence" where everything becomes digital and everything is inter-networked. The application will be dominated by storage, search, retrieval, analysis, exchange and updating of information in a wide variety of forms. Heavy demands will be placed on systems by many simultaneous requests. And, fundamentally, all this shall be delivered at much higher levels of dependability, integrity and security. Increasingly, large parallel computing systems and networks are providing unique challenges to industry and academia in dependable computing, especially because of the higher failure rates intrinsic to these systems. The challenge in the last part of this decade is to build a systems that is both inexpensive and highly available. A machine cluster built of commodity hardware parts, with each node running an OS instance and a set of applications extended to be fault resilient can satisfy the new stringent high-availability requirements. The focus of this book is to present recent techniques and methods for implementing fault-tolerant parallel and distributed computing systems. Section I, Fault-Tolerant Protocols, considers basic techniques for achieving fault-tolerance in communication protocols for distributed systems, including synchronous and asynchronous group communication, static total causal ordering protocols, and fail-aware datagram service that supports communications by time. A common framework for describing synchronous and asynchronous group communication services and a comparison of the properties that synchronous and asynchronous group communication can provide to simplify replicated programming is presented in the paper "Comparing Synchronous and Asynchronous Group Communication". Group communication services, such as membership and atomic broadcast, simplify the maintenance of state replica consistency despite random communication delays, failures and recoveries. In distributed systems, high service availability can be achieved by letting a group of servers replicate the service state; if some servers fail, the surviving ones know the service state and can continue to provide the service.

8 x The paper "Using Static Total Causal Ordering Protocols to Achieve Ordered View Synchrony" describes a view-synchronous totally ordered message delivery protocol for a dynamic asynchronous process group in an asynchronous communication environment. The protocol can handle asynchronous processes or link failures and also the simultaneous joining of multiple group of processes. A fail-aware datagram service that supports communication by t.ime delivers all messages whose computed one-way transmission delays are smaller than a given bound as "fast" and all other message as "slow" is presented in the paper "A Fail-Aware Datagram Service". The fail-aware datagram service is the foundation of all other fail-aware services, such as fail-aware clock synchronization, fail-aware membership and fail-aware atomic broadcast. In Section II, Fault-Tolerant Distributed Systems, we consider different methods and approaches for achieving fault tolerance in distributed systems such as portable check-pointing for heterogeneous architectures, checkpointing-recovery scheme insuring domino-freeness, dependable cluster systems, recoverable distributed shared memory (DSM) on a network of workstations (NOW), faulttolerant scalable coherent interface (SCI)-based local area multiprocessor. An approach, which enables the failed computation to be recovered on a different processor architecture is shown in the paper "Portable Checkpointing for Heterogeneous Architectures". Sequential C programs are compiled into faulttolerant C programs, whose checkpoints can be migrated across heterogeneous networks and restarted on binary-incompatible architectures. The paper "A Checkpointing-Recovery Scheme for Domino-Free Distributed Systems" presents a checkpointing-recovery scheme for distributed systems. The proposed checkpointing algorithm ensures the progression of the recovery line reducing the number of checkpoints in comparison to previous proposals. The goal is achieved by introducing an equivalence relation between local checkpoints of a process and by exploiting the process' event history. A hardware architecture based on a cluster of commodity p a i3 ~ and a set of software cluster services that will help in the design implementation and deployment of fault-resilient software is described in the paper "Overview of a Fault-Tolerant System". Depending on the use of these services and mechanisms the system can reach different levels of fault tolerance and reliability characteristics. Networks of Workstations (NOW) have become a convenient and less expensive alternative to parallel architectures for the execution of long-running parallel applications. The paper "An Efficient Recoverable DSM on a Network of Workstations: Design and Implementation" presents the realization and performance evaluation of ICARE - a recoverable DSM (RDSM) associated with a process checkpointing mechanism. ICARE tolerates a single permanent node failure transparently to parallel applications which continue their execution on the remaining nodes. A prototype of ICARE is fully operational on an ATM network of workstations, running CHORUS micro-kernel. In the paper "Fault-Tolerant Issues of Local Area Multiprocessor (LAMP) Storage Subsystem" three main fault tolerance issues of the LAMP storage subsystem are discussed: system configurability for fault tolerance and perfor-

9 mance, fast error detection and recovery, and fast logical volume reconstruction. Local Area MultiProcessor (LAMP) is a network of workstations with a shared physical memory. It uses low-latency and high bandwidth interconnections and provides remote DMA support. The interconnection is the Scalable Coherent Interface (SCI) which provides cache coherent, physically shared memory for multiprocessors via its bus-like point-point connections with high bandwidth and low latency. The interconnection network of LAMP is based on the Scalable Coherent Interface (SCI, IEEE std 1596 Scalable Coherent Interface). The paper "Fault-Tolerance Issues in RDBMS on SCI-based Local Area Multiprocessor (LAMP)" explores the issues related to implementation of database systems on LAMP, particularly the fault-tolerant issues. In Section III, Dependable Systems, we consider general models and features of distributed safety-critical systems using commercial off-the-shelf component (COTS), service dependability in telecomputing systems constructed with offthe-shelf components offering scalability and graceful degradation, a scalable and testable heterogeneous embedded architecture based on COTS for high-end signal processing applications, a fault-tolerant SPMD hierarchical architecture for real time processing of video signals. An overview of the problems encountered by those designing safety-critical systems along with the fundamentals. definitions and concepts employed by their design is presented in the paper "Distributed Safety-Critical Systems". A taxonomy that classifies the design solution space for safety-critical systems is presented. The paper "Dependability and Other Challenges in the Collision between Computing and Telecommunication" describes a distributed system composed of off-the-shelf components which can deliver advanced telecommunication services. It is pointed out that the main difficulty to realize services using this approach resides in the need to create a robust dependable system. The resources and their servers are heterogeneous and may be distributed locally or globally in the network. This architecture offers scalability and congestion management, and poses the significant challenge of overall service dependability. A new concept, that of scalable and testable embedded systems, is introduced in the paper "A unified approach for the synthesis of scalable embedded architectures". Parallel heterogeneous architectures based on COTS (Commercial Off-The-Shelf) components are becoming increasingly attractive as computing platforms for high-end signal processing applications such as Radar and Sonar. In comparison with traditional custom VLSI designs, these architectures offer advantages of flexibility, high performance, rapid design time, easy upgradability, and low cost. The paper describes an unified approach for the synthesis of scalable architecture, based on COTS components. The approach is illustrated through a concrete example of a signal processing application. A fault-tolerant SPMD hierarchical architecture for real-time processing of video signals is introduced in the paper "A Fault-Robust SPMD Architecture for 3D-TV Image Processing". Fault-tolerant characteristics are evaluated by comparing the images produced by the system with and without faults in the architecture. xi

10 xii Section IV, Fault-Tolerant Parallel Systems, considers embedding complete binary trees into a faulty hypercube interconnection architecture, single-node broadcasting in a faulty k-ary n-cube, software-implemented system-level testing technique for multiprocessor systems with dedicated communication links, reliable execution of tasks and concurrent diagnosis of faulty processors and links, conditional connectivity for the m-ary n-dimensional hypercube, on-line recovery from intermittent and permanent faults within the links and nodes in two-dimensional meshes, fault-tolerance in parallel computers based on checkpointing, self-diagnosis and rollback recovery, functional and attribute-based language for programming fault-tolerant applications, user-transparent backward error recovery for message passing systems are considered. A scheme that can be used recursively in parallel to map a complete binary tree into a hypercube interconnection architecture with some faulty nodes is proposed in the paper "A Parallel Algorithm for Embedding Complete Binary Trees in the Faulty Hypercubes". Two algorithms have been described: one for a fault-free hypercube and the other for a faulty hypercube. It is shown that the scheme has a low time complexity as compared to the complexity of the existing algorithms. The paper "Fault-Tolerant Broadcasting in a K-ary N-cube " depicts an algorithm for one-to-all broadcasting in a k-ary-n cube. The algorithm is nonredundant and fault-tolerant, and broadcasts correctly given n-l or less faults. It is called Partner Fault-Tolerant Algorithm. The time complexity of the algorithm is given. The paper "Fault Isolation and Diagnosis in Multiprocessor Systems with Point-to-Point Communication Links" presents an approach, which combines distributed system-level testing with processor self-test, and ensures fault-free operation by disconnecting all faulty processors and links from the system. The placement of monitors has been determined for several multiprocessor topologies including trees, hypercubes and meshes. In the paper "An Efficient Hardware Fault-Tolerant Technique" it is shown, that based on an efficient hardware fault-tolerant technique the reliable execution of tasks and concurrent diagnosis of faults can be accomplished, while processors and communication channels are subject to failure. The paper "Reliability Evaluation of a Task under a Hardware Fault-Tolerant Technique" presents an efficient technique, based on which each task's reliability is increased when processors and communication channels are subject to failure. The concept of a forbidden set is exploited in the paper "Fault Tolerance Measures for M-ary N-dimensional Hypercubes Based on Forbidden Faulty Set" to achieve fault tolerance in hypercubes. In general, there are many ways to define a forbidden (feasible) faulty set depending on the topology of the system, application environment, statistical analysis of faulty patterns, and distribution of faulty-free nodes. An algorithm for detecting and compensating for intermittent and permanent faults within the links and nodes of parallel computers, having an NxN

11 two-dimensional mesh interconnection topology, is described in the paper" Online Fault Recovery for Wormhole-Routed Two- Dimensional Meshes". A fully distributed algorithm for fault-tolerant scheduling is given in the paper "Fault-Tolerant Dynamic Task Scheduling". The main advantage of this algorithm is that fail-soft behavior (graceful degradation) is achieved in a user-transparent way. Another important aspects of this approach is that it is applicable for a wide variety of target machines including message-passing architectures, workstation clusters or even shared memory machines. A replication technique based on the FTAG computation model, and different novel mechanisms for recovery in case of failures are presented in the paper "A Novel Replication Technique for Implementing Fault-Tolerant Parallel Software". FTAG is functional and attribute based language for progr&mming fault-tolerant parallel applications. User-transparent backward error recovery for message-passing systems is presented in the paper "User-Transparent Checkpointing and Restart for Parallel Computers". This book contains selected and revised articles at the IEEE Fault-Tolerant Parallel and Distributed Systems (FTPDS'98) workshops, Hawaii, Honolulu, 1996 and Geneva, Switzerland, As well, several authors have been invited to submit papers. The selection process of the papers was greatly facilitated by the steadfast work of the program committee members and the reviewers, for which we are most grateful. We would like to extend a special thanks to the members of the Network Computing Laboratory, Department of Electrical and Computer Engineering at Boston University for their help. xiii

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Diana Hecht 1 and Constantine Katsinis 2 1 Electrical and Computer Engineering, University of Alabama in Huntsville,