Data Consistency with SPLICE Middleware. Leslie Madden Chad Offenbacker Naval Surface Warfare Center Dahlgren Division

Similar documents
Intelligent Event Processing in Quality of Service (QoS) Enabled Publish/Subscribe (Pub/Sub) Middleware

WAN-DDS A wide area data distribution capability

Building High-Assurance Systems out of Software Components of Lesser Assurance Using Middleware Security Gateways

Examining the Use of Java Technologies for Real-Time in a Prototype U.S. Surface Navy Combat System Application

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

LCCI (Large-scale Complex Critical Infrastructures)

Designing High Performance IEC61499 Applications on Top of DDS

OMG Real Time Workshop

NEXT STEPS IN NETWORK TIME SYNCHRONIZATION FOR NAVY SHIPBOARD APPLICATIONS

DS 2009: middleware. David Evans

MilSOFT DDS Middleware

Introduction to DDS. Brussels, Belgium, June Gerardo Pardo-Castellote, Ph.D. Co-chair OMG DDS SIG

Reliable UDP (RDP) Transport for CORBA

Replication in Distributed Systems

Time and Space. Indirect communication. Time and space uncoupling. indirect communication

Data Model Considerations for Radar Systems

Global Data Plane. The Cloud is not enough: Saving IoT from the Cloud & Toward a Global Data Infrastructure PRESENTED BY MEGHNA BAIJAL

Intra-cluster Replication for Apache Kafka. Jun Rao

Distributed Filesystem

Assignment 5. Georgia Koloniari

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

What is a distributed system?

Data Distribution Service A foundation of Real-Time Data Centricity

F O U N D A T I O N. OPC Unified Architecture. Specification. Part 1: Concepts. Version 1.00

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Cluster-Based Scalable Network Services

Distributed Systems Principles and Paradigms. Chapter 01: Introduction

Distributed Systems Principles and Paradigms. Chapter 01: Introduction. Contents. Distributed System: Definition.

Distributed Systems Principles and Paradigms

Convergence of Distributed Simulation Architectures Using DDS

Guide to Mitigating Risk in Industrial Automation with Database

CS 43: Computer Networks. 08:Network Services and Distributed Systems 19 September

Cycle Sharing Systems

CA464 Distributed Programming

EECS 482 Introduction to Operating Systems

A Comparison of Simulation and Operational Architectures

An extensible DDS-based monitoring and intrusion detection system

Indirect Communication

BUILDING A SCALABLE MOBILE GAME BACKEND IN ELIXIR. Petri Kero CTO / Ministry of Games

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Distributed KIDS Labs 1

How to tackle performance issues when implementing high traffic multi-language search engine with Solr/Lucene

Vortex Whitepaper. Simplifying Real-time Information Integration in Industrial Internet of Things (IIoT) Control Systems

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Datacenter replication solution with quasardb

CHAPTER 1 Fundamentals of Distributed System. Issues in designing Distributed System

Google File System. By Dinesh Amatya

02 - Distributed Systems

Novel Design of Collaborative Automation Platform Using Real-Time Data Distribution Service Middleware for An Optimum Process Control Environment

Distributed Systems. Bina Ramamurthy. 6/13/2005 B.Ramamurthy 1

02 - Distributed Systems

Database Architectures

SimWare-Kernel Real Time Communication System for Simulation (Paper by: Bruno Calvo, Ignacio Seisdedos)

VMware Mirage Getting Started Guide

Abstract. NSWC/NCEE contract NCEE/A303/41E-96.

Introduction to the Service Availability Forum

Nanbor Wang Tech-X Corporation

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Data Loss and Component Failover

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

ONOS: TOWARDS AN OPEN, DISTRIBUTED SDN OS. Chun Yuan Cheng

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

NPTEL Course Jan K. Gopinath Indian Institute of Science

Introduction to Distributed Systems (DS)

The Google File System

Refresher: Lifecycle models. Lecture 22: Moving into Design. Analysis vs. Design. Refresher: different worlds. Analysis vs. Design.

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

management systems Elena Baralis, Silvia Chiusano Politecnico di Torino Pag. 1 Distributed architectures Distributed Database Management Systems

SELEX Sistemi Integrati

DDS Interoperability Demo December 2010

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Introduction to Distributed Systems. INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio)

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Databases. Laboratorio de sistemas distribuidos. Universidad Politécnica de Madrid (UPM)

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Monitoring a spacecraft from your smartphone using MQTT with Joram

DDS Interoperability Demo

The Google File System

DDS TUTORIAL. For OMG RTE workshop July 2007 V3. Hans van t Hag, DDS product Manager

Evolving the CORBA standard to support new distributed real-time and embedded systems

Dependable Computing Clouds for Cyber-Physical Systems

CIT 668: System Architecture. Amazon Web Services

System Models for Distributed Systems

Distributed File Systems II

Aurora, RDS, or On-Prem, Which is right for you

DDS Interoperability in SWIM

Applying DDS to Large Scale Mission Critical Distributed Systems! An experience report

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

IKE2 Implementing Implementing the Stateful Distributed Object Paradigm

GoAhead Software NDIA Systems Engineering 2010

Chapter 11 - Data Replication Middleware

SCALARIS. Irina Calciu Alex Gillmor

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Best Practices in Designing Cloud Storage based Archival solution Sreenidhi Iyangar & Jim Rice EMC Corporation

Dependability Considerations in Distributed Control Systems

Transcription:

Data Consistency with SPLICE Middleware Leslie Madden Chad Offenbacker Naval Surface Warfare Center Dahlgren Division Slide 1 6/30/2005

Background! US Navy Combat Systems are evolving to distributed systems and open standards Leverage commodity technology Enable technology refresh while reducing cost More quickly add new warfighting capability and capacity to meet threats! Combat systems need to distribute data across applications while maintaining a degree of data consistency sufficient to Meet fault tolerance requirements Meet failure recovery requirements Support correct execution of distributed, real-time applications! Historically, US Navy Programs of Record (PORs) have developed nonstandards-based, custom solutions for maintaining distributed state data. Full visibility into solution allows developers a full understanding of the consistency, fault tolerance, and failure recovery characteristics.! Various US Navy and DoD thrusts encourage the use of standards-based solutions and COTS or open source products where feasible DDS Persistence Profile potentially provides a standards-based solution to data consistency across distributed applications Slide 2 6/30/2005

US Navy Surface Domain OA Computing Environment (OACE) Standards Based Domain Unique Open Architecture Computing Environment (OACE) Standards Based Commercial mainstream Developer Choice Standards Based Warfighting Applications & Common Services Middleware (OMG) Operating Systems (IEEE POSIX ) Mainstream COTS Processors Cable Plant & Layer 3 Switched / Routed LAN (TIA, IETF) App App App App Unique Common Time Nav DX/DR Distribution Adaptation Frameworks OS OS OS OS OS OS CPU CPU CPU CPU CPU CPU R e s o u r c e M a n a g e m e n t Standards and Middleware Isolate Applications From Technology Change Slide 3 6/30/2005

DDS and Data Consistency! The DDS specification provides data durability QoS to control the middleware retention of distributed data for retransmission in the case of newly initiated data readers. Volatile Service does not retain any data on behalf of the data writer. Only existing data readers will receive data. Transient-Local Service will transmit data to new subscribers only if the original data writer is alive. * Transient Service will retain and transmit data to new subscribers, even if the original data writer is no longer alive, as long as the service has not terminated. * Persistent Data is kept on permanent media so it may outlive service.! Transient durability, combined with reliable delivery and careful use of key fields appears to provide a standardsbased mechanism for maintaining data consistency under application failure and restart conditions. * In-memory maintenance of data based on history and resource-limits QoS Slide 4 6/30/2005

Benchmark Objectives! Develop a mechanism to empirically evaluate how consistent distributed data is maintained by DDS implementations Source code analysis may not be feasible. Commercial product vendors may not be willing to provide source code or may provide it at a high cost. Cost of analysis may be prohibitive. Understanding the circumstances under which data consistency is & is not maintained will allow engineering of a system that meets fault tolerance and correctness requirements.! Gain an understanding of the DDS Persistency Profile and how it might support US Navy Combat System data consistency needs Need to understand differences between the DDS specification and the SPLICE implementation. Characterize the performance & behavior of the SPLICE implementation. Slide 5 6/30/2005

Evaluation Goals! Assess behavioral characteristics critical to highly reliable systems Fault tolerance Data consistency! Determine performance & resource utilization characteristics - is it sufficient for all or some subset of Surface Navy Combat System applications Latency Performance scalability CPU & memory! Assess appropriate architecture patterns and usage strategies based on characteristics - significant differences between SPLICE and other publish/subscribe middleware products SPLICE provides in memory data store that is refreshed by published data SPLICE provides query-like mechanism for accessing data and joining stored data items supported by the DDS Content-Subscription profile, but not currently implemented by other products Slide 6 6/30/2005

SPLICE Architecture Host Host Application Application Application Application queries queries queries queries Publish/Subscribe API Publish/Subscribe API Publish/Subscribe API Publish/Subscribe API Pub/Sub Service Data Store Pub/Sub Service SPLICE Pub/Sub Service Data Store Pub/Sub Service SPLICE Daemon SPLICE Daemon Network Slide 7 6/30/2005

Challenges! A test harness is required to capture externally observable behavior. Resulting data must be analyzable to draw conclusions about the degree of data consistency provided by the middleware.! Test suite must be configurable to simulate a variety of system characteristics e.g.: Different numbers of data producers and consumers. Different configurations of replicated data producers and consumers. Differences in data processing (algorithms) that may affect consistency in replicated consumers e.g. storing a value vs maintaining a sum of values received. Ability to inject different types of faults into data producers and consumers.! The health and status of each test application must be monitored continuously so the time fault occur can be accurately determined.! Every change to state data for each data producer and consumer must be monitored on a continuous basis.! Given that data inconsistency may occur infrequently, a mechanism is required to analyze large quantities of data from long periods of test execution. Slide 8 6/30/2005

Fault Tolerance Experiment status monitor State A Publisher State A Subscriber Replica 1 Local Subscriber status monitor Host 3 Resultant State Host 1 Host 2 status monitor State B Publisher State B Subscriber Replica 2 Subscriber Replica 3 Local Subscriber status monitor Host 4 Local Subscriber status monitor instrumentation Host 5 Data Collector Fault Injector Off-line Data Analysis Slide 9 6/30/2005

Test Configuration Host 1 Publisher Publisher increments message state value with each issue Replicas compute a new replica state with each issue using message state and current replica state. Replica Restart Host 2 Subscriber Replica 1 Host 4 Each replica publishes & reads its state using with each issue received Host 3 Subscriber Replica 2 Fault Injection! Default QoS: Durability = Volatile; Reliability = Guaranteed! State QoS: Durability = Transient; Reliability = Guaranteed Slide 10 6/30/2005

SPLICE Data consistency during periodic shutdown and restart of same node subscriber processes. State value data shown for both STATE & DEFAULT QOS subscribers across all nodes. Both STATE & DEFAULT QOS subscriber processes execute concurrently on each node. Subscriber processes on each node shutdown and restart at different rates, or not at all. Despite frequent shutdowns and restarts, the STATE_QOS subscriber processes appear to maintain a close state value consistency across separate nodes over time.

SPLICE Data consistency with periodic shutdown and restart of same-node subscriber processes. Counter-clockwise inspection of graphs reveal a much closer view of state value consistency at the beginning, middle, and end of the aforementioned test run. The rate of change of each STATE_QOS subscriber state value REMAINS CONSISTENT across the SPLICE nodes.

SPLICE Data consistency during shutdown of a subscriber node and start of its alternate node replica. Test required a COLD start of a SPLICE daemon on a replica node, before initiating a replica subscriber process (daemon started AFTER node shutdown). Human factor and architecture contributed to delayed subscriber replica start-up. Lower right graph shows clearly the ability of the STATE_QOS subscriber replica to quickly reach its appropriate state value upon start-up.

SPLICE Data consistency during shutdown of a subscriber node and start of its alternate node replica. Test required a WARM start of a SPLICE daemon on a replica node, before initiating a replica subscriber process (daemon started BEFORE node shutdown). Human factor and architecture contributed to delayed replica start-up time. Lower right graph shows clearly the ability of the STATE_QOS subscriber replica to quickly reach its appropriate state value upon start-up.

Preliminary Results! State QoS - Transient & Guaranteed SPLICE maintains consistency of state through multiple subscriber replica process faults and restarts, even when a replica is restarted on a previously unused host. Issues transmitted prior to subscriber replica restart may be lost if subsequent issue with the same key field is transmitted in the interim. Otherwise, all issues will be received.! Default QoS - Volatile & Guaranteed SPLICE maintains consistency of state through multiple subscriber replica process faults & restarts when a replica is immediately restarted on the same host, but not when it is started on a previously unused host. Issues transmitted prior to subscriber replica restart may not be received by restarted replica. SPLICE Data Persistence Behaves As Advertised Slide 15 6/30/2005

Next Steps! Collect and analyze data from longer duration executions! Collect and analyze data from different types of faults (e.g. host faults, network faults)! Analyze performance data for ability to meet combat system fail-over requirements! Assess other DDS implementations of persistence profile as they become available! Develop prototype applications to assess applicability of technology in a multi-middleware technology, heterogeneous, system-scale environment Slide 16 6/30/2005