Introduction to the Service Availability Forum

Similar documents
Using OpenSAF for carrier grade High Availability

OpenSAF More than HA. Jonas Arndt. HP - Telecom Architect OpenSAF - TCC

ATCA, HPI, AIS open specifications for HA applications. Artem Kazakov SOKENDAI/KEK TIPP09

Evaluation of the SAF AIS implementations

A Framework for Testing AIS Implementations

高可用集群的开源解决方案 的发展与实现. 北京酷锐达信息技术有限公司 技术总监史应生

FAULT TOLERANT SYSTEMS

Fault tolerance and Reliability

GoAhead Software NDIA Systems Engineering 2010

Managing High-Availability and Elasticity in a Cluster. Environment

High Availability with the openais project. Prepared by: Steven Dake October 2005

Introduction to Service Availability Forum

Developing Microsoft Azure Solutions (70-532) Syllabus

Mission Critical Linux

Maximum Availability Architecture (MAA): Oracle E-Business Suite Release 12

Dependability tree 1

Experience the GRID Today with Oracle9i RAC

Fault Tolerance. The Three universe model

COMPARING ROBUSTNESS OF AIS-BASED MIDDLEWARE IMPLEMENTATIONS

Service Availability TM Forum Service Availability Interface

The openais project. Technomap. Prepared by Steven Dake January 2005

The Service Availability Forum Platform Interface

Virtualization and High-Availability

Utilization of Open-Source High Availability Middleware in Next Generation Telecom Services

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

A SKY Computers White Paper

High Availability and Redundant Operation

Dep. Systems Requirements

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Quobyte The Data Center File System QUOBYTE INC.

Scalable and Flexible Software Platforms for High-Performance ECUs. Christoph Dietachmayr Sr. Engineering Manager, Elektrobit November 8, 2018

Automated Configuration Design and Analysis for Service. High-Availability

Middleware for Embedded Adaptive Dependability (MEAD)

An Empirical Study of High Availability in Stream Processing Systems

Tiered Storage and Caching Decisions

OL Connect Backup licenses

AN ADAPTIVE ALGORITHM FOR TOLERATING VALUE FAULTS AND CRASH FAILURES 1

Middleware and Distributed Systems. Fault Tolerance. Peter Tröger

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

CSCI 4717 Computer Architecture

Streaming data Model is opposite Queries are usually fixed and data are flows through the system.

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Intelligent Middleware. Smart Embedded Management Agent. Cloud. Remote Management and Analytics. July 2014 Markus Grebing Product Manager

Issues in Programming Language Design for Embedded RT Systems

Replication architecture

Swiss IT Pro SQL Server 2005 High Availability Options Agenda: - Availability Options/Comparison - High Availability Demo 08 August :45-20:00

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Developing Microsoft Azure Solutions (70-532) Syllabus

Process Historian Administration SIMATIC. Process Historian V8.0 Update 1 Process Historian Administration. Basics 1. Hardware configuration 2

Aurora, RDS, or On-Prem, Which is right for you

Distributed Embedded Systems and realtime networks

Aerospace Software Engineering

Open Communication Architecture Forum

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

A Model Driven Approach for the Generation of Configurations for Highly-Available Software Systems

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MOVING FORWARD WITH FABRIC INTERFACES

High Availability Distributed (Micro-)services. Clemens Vasters Microsoft

Today: Distributed Middleware. Middleware

Skybot Scheduler Release Notes

Data Consistency with SPLICE Middleware. Leslie Madden Chad Offenbacker Naval Surface Warfare Center Dahlgren Division

ebay s Architectural Principles

MultiChipSat: an Innovative Spacecraft Bus Architecture. Alvar Saenz-Otero

Datacenter replication solution with quasardb

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Amazon Aurora Relational databases reimagined.

HA solution with PXC-5.7 with ProxySQL. Ramesh Sivaraman Krunal Bauskar

Are AGs A Good Fit For Your Database? Doug Purnell

Clustering In A SAN For High Availability

Abstract /10/$26.00 c 2010 IEEE

GlassFish High Availability Overview

Disaster Recovery Solution Achieved by EXPRESSCLUSTER

Designing Fault-Tolerant Applications

HP NonStop Database Solution

Introduction to Database Services

Your Challenges, Our Solutions. The NAS and/or laptop storage are not enough; how can I upgrade once and expand all?

ebay Marketplace Architecture

Experience in Developing Model- Integrated Tools and Technologies for Large-Scale Fault Tolerant Real-Time Embedded Systems

Veeam Endpoint Backup

Storage. Hwansoo Han

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Lowering Cost per Bit With 40G ATCA

The Walking Dead Michael Nitschinger

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Basic vs. Reliable Multicast

Distributed Systems 24. Fault Tolerance

TANDBERG Management Suite - Redundancy Configuration and Overview

Cycle Sharing Systems

GFS: The Google File System. Dr. Yingwu Zhu

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Developing Microsoft Azure Solutions (70-532) Syllabus

02 - Distributed Systems

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Assignment 5. Georgia Koloniari

#techsummitch

02 - Distributed Systems

Providing Real-Time and Fault Tolerance for CORBA Applications

Taurus Super-S LCM. Dual-Bay RAID Storage Enclosure for two 3.5 Serial ATA Hard Drives. User Manual July 27, v1.2

Transcription:

. Introduction to the Service Availability Forum

Contents Introduction Quick AIS Specification overview AIS Dependability services AIS Communication services Programming model DEMO

Design of dependable services Two parts of functionality Business logic Implements the service Common functionality Communication, management, fault tolerance This is where SA Forum AIS comes into picture 5 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Construction of a service Define the functionality Plan the architecture Components, roles Integrate fault management Service state monitoring Management Plan recovery Communication between components 6 Copyright 2006 Service Availability Forum, Inc

Construction of a service Define the functionality Plan the architecture (AMF) Components, roles Integrate fault management (AMF) Service state monitoring Management Plan recovery (CKPT) Communication between components (MSG, EVT) 7 Copyright 2006 Service Availability Forum, Inc

The Transition COTS adoption is accelerating transition from vertical to horizontal industry model Vertically integrated platform by individual vendors Vendor A Vendor D Platform enabled by The COTS eco-system Applications Applications Applications Vendor D Proprietary Middleware Proprietary Middleware Integrated COTS Middleware Vendor C Proprietary Operating System Proprietary Operating System Carrier-Grade Operating System Vendor B Proprietary Hardware Proprietary Hardware Standard Hardware Vendor A Proprietary Platform Proprietary Platform COTS-Based Platform 8 Copyright 2007 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

SMI (SA Forum) SA Forum Standard Interface Specifications Application Interface Specification (AIS) Hardware Platform Interface (HPI) Highly Available Applications AIS (SA Forum) Service Availability Middleware Other Middleware and Application Services Systems Management Interfaces (SMI) HPI (SA Forum) Carrier Grade Operating System Hardware Platform Communications Fabric 9 Copyright 2007 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

SA Forum Members 10 Copyright 2007 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

THE AIS SPECIFICATION Using SA Forum s specifications to build highly available services 11 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Application Interface Specification The Application Interface Specification (AIS) is a set of open standard interface specifications The AIS specifies the Application Programming Interface (API) for HA middleware services service entities and their behavior (life cycle, administrative operations, functionality) Example for specification/implementation HW interface: SATA HW implementation: manufactured mainboard HW user: the HDD (Samsung, Western Digital, Seagate, Hitachi ) 12 Copyright 2006 Service Availability Forum, Inc

Application Interface Specification Web server, Music broadcast CLM AMF IMM NTF LOG CKPT MSG EVT LCK Management HA Applications Other Middleware and Application Services SA Forum AIS APIs, HPI Middleware AIS Middleware Carrier Grade Operating System Boards, fans, power, etc. Managed Hardware Platform 13 Copyright 2006 Service Availability Forum, Inc

Application Interface Specification The AIS is divided into the following parts or areas Administration Software Management Framework Information Model Management Service Cluster Membership Service Notification Service Dependability Availability Management Framework Checkpoint Service Communication Event Service Message Service Miscellaneous Lock Service Log Service 14 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Example service - Message Queues Node U Node V Node W Process A MSG library Process B MSG library Process C MSG library samsgmessagesend() samsgmessagesend() Message Service Message Service Message Service Message m1 priority 0 sent to queue Q Message m2 priority 1 sent to queue Q Getting a message from a queue removes it from the queue Node Y Queue Q priority area 0 priority area 1 priority area 2 priority area 3 Process E MSG library samsgmessageget() samsgmessageget() Message Service Copyright 2006 Service Availability Forum, Inc 16

THE AIS SPECIFICATION Dependability Services 17 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

SA Forum s approach to making applications HA Availability Management Framework principles Integrate best practices into the middleware, Provide means for the software developer to influence behavior Let the middleware control the application (like a Marionette)

SA Forum s approach to making applications HA Fault prevention Reduce the probability of system failure to an acceptably low value Fault tolerance Provide service in spite of faults Fault removal Diagnostics, monitoring, repair at development and runtime Fault forecasting / prediction Estimating failures and their effects E.g. software aging

Server Clustering Notions node link N1 N2 N4 cluster partition Functionalities of cluster middleware Error detection N3 N5 Error handling Notification Reconfiguration

Server Clustering 2. Categories of clusters HA clusters Improve the availability of services Load-balancing clusters Share the workload among the nodes High-Performance Clusters (HPC) Scientific computing Grid computing Many independent jobs

HW and SW based Fault Tolerance HW Redundant power supply SW Process replicas, failover, switchover Clusters hybrid solutions Process replicas on different nodes

AMF System Model Example Node U Node V Node W Node X Service Unit S1 Component C1 Component C2 Service Group SG1 Service Unit S2 Component C3 Component C4 Service Service Unit S3 Group SG2 Service Unit S4 Service Unit S5 Component C5 Component C6 Component C7 Service group SG1 supports a single service instance A, and service group SG2 supports two service instances B and D On behalf of A, service unit S1 is assigned the active HA state and service unit S2 is assigned the standby HA state Service unit S1 contains two components C1 and C2, and service unit S2 contains two components C3 and C4 Active Standby Active Standby Active Standby Service Instance A Service Instance B Service Instance D Similarly, for service group SG2 24 Copyright 2006 Service Availability Forum, Inc

The SA Forum Information Model

Fault Management Flow Chart The error is detected An alternate form of fault detection, including built-in diagnosis Hypothetical cause of error is located The defective component is removed from service The system is reconfigured or restarted to function properly A faulty system component is replaced

Error detection methods Mechanisms Interface check (e.g. illegal instruction, illegal parameter, insufficient access rights) Ad-hoc methods Acceptability testing Error checking codes Timing checks (watchdog) Diagnostic analysis (in idle time) Substitute back (integrate derive)

AMF error detection mechanisms Passive monitoring Using OS functionality Currently only crash of a process External active monitoring External entity is used to monitor Internal active monitoring Healthchecks Types Pull: AMF invoked Push: Component invoked Needs the change of the component

Planning recovery - checkpointing Define the variables to be saved State variables Normal operation Open checkpoint Save variables regularly On failure Read checkpoint Continue normal operation 29 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Checkpoint Service Example of checkpointing, and restoration from a checkpoint after a fault, for a collocated checkpoint Node U Service Unit S Active Component Service Group Node U Service Unit S Standby Active Component open write synchronize Active replica open read Standby Active replica Checkpoint Service Checkpoint C Section abc Section xyz Checkpoint C Section abc Section xyz 30 Copyright 2006 Service Availability Forum, Inc

Methods of recovery Goal: Recovery from errors before failures. Trivial methods Retry operation Restart the application (cold, warm) Restart the system Techniques Forward recovery Backward recovery Compensation

Service continuity Goal: maintain the service continuity Transient Permanent Type of error Related actions Ignore (recovery handles them) Reconfiguration Failover Switchover Fail-back Graceful degradation

Actions Failover The service is removed from the failed entity and assigned to a healthy entity Switchover The service is removed from a healthy entity to another healthy entity Fail-back The failed entity is repaired and service is assigned to it again Graceful degradation Service is carried on in a degraded state, probably with degraded functionality

System level error handling patterns Fail-silent If failed then no messages are sent out until repaired E.g. in distributed systems Fail-stop Stop operation on failure Normal Failing Silent E.g. train traffic control systems (turn all lights red if failure happens) Francisco V. Brasileiro et. al. Implementing Fail-Silent Nodes for Distributed Systems (1996 )

Requirements in today s systems Cost efficiency Flexibility Dependable services Highly Available Reliable Availability Outage duration per year 0.99999 ~5 mins 0.9999 ~52 mins 0.999 ~8 hrs 0.99 ~3 days

Dependability and performance vs. costs cost

FT vs. Cost International Space Station Advanced techniques Cost Cost of construction Algorithms Desktop Computer HW / SW redundancy Degree of fault tolerance

THE AIS SPECIFICATION Communication Services 42 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Application Interface Specification Web server, Music broadcast CLM AMF IMM NTF LOG CKPT MSG EVT LCK Management HA Applications Other Middleware and Application Services SA Forum AIS APIs, HPI Middleware AIS Middleware Carrier Grade Operating System Boards, fans, power, etc. Managed Hardware Platform 43 Copyright 2006 Service Availability Forum, Inc

Communication Message (MSG) Recommended for many to one scheme Example: collect sensor data Event (EVT) Many to Many (publish subscribe) scheme Example: Sentinels and villagers 44 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Message Service Message Service, Message Service (MSG) Library, and Message Queue Node U Node V Node W Process A MSG library Process B MSG library Process C MSG library samsgmessagesend() samsgmessagesend() Message Service Message Service Message Service Message m1 priority 0 sent to queue Q Message m2 priority 1 sent to queue Q Node Y Process E MSG library Getting a message from a queue removes it from the queue Message Queue Q priority area 0 priority area 1 priority area 2 priority area 3 samsgmessageget() samsgmessageget() Message Service Copyright 2006 Service Availability Forum, Inc 45

Event Service Example of the Event Service, with publishers and subscribers on two nodes Publisher Subscriber EVT library Node U Publisher EVT library Subscriber EVT library Subscriber EVT library Node V Publisher EVT library Subscriber EVT library EVT library Filter Filter Filter Filter Event Service Event Channel X Event Channel Y 46 Copyright 2006 Service Availability Forum, Inc

THE AIS SPECIFICATION Programming model 47 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Programming model Library life cycle Initialize, finalize, dispatch events Service specific APIs Callbacks Request functions 48 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Programming model - AMF Library life cycle saamfinitialize(), saamffinalize(), saamfdispatch() Service specific APIs Callbacks saamfcsisetcallbackt() Request functions saamfhastateget() 49 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Programming model how to use it? Initialize service handler saamfinitialize() Set callbacks Get selection object saamfselectionobjectget() Start main loop do { select(amfselectionobject, ) saamfdispatch() } while (1) 50 Copyright 2004 Service Availability Forum, Inc. - Other names and brands are properties of their respective owners.

Demo Highly Available / Fault Tolerant Music Broadcast Application

Architecture Internet Broadcast Server Icecast Input buffer Source Server 2 Source Server 1 Music stream source application Music stream source application Clients The source application is made HA

Online radio streaming system

AMF configuration Debian 1 Music Source App MSource SG Debian 2 Tolerated failures Component Node One communication channel MSource SU 1 MSource SU 2 Mcomp 1 MSource SI 1 Mcomp 2 MCSI

Where should I continue? MP3 streamer Middleware OS MP3 streamer Middleware OS Position STATE Active Faulty Standby Active

Acknowledgements András Kövi Zoltán Micskei István Majzik András Pataricza

References Service Availability Forum http://saforum.org Application Interface Specification http://www.saforum.org/specification/ais_information/ Education Material http://www.saforum.org/education/ Fault Tolerance Research Group (BME MIT) http://www.inf.mit.bme.hu/ftsrg/