Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services

Size: px
Start display at page:

Download "Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services"

Transcription

1 Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services Jacob R. Lorch Andrew Baumann Lisa Glendenning Dutch T. Meyer Andrew Warfield

2 Our goal: Turn existing binaries into faulttolerant services. Jay Lorch, Microsoft Research Tardigrade 2

3 Example: FDS Metadata Service FDS Metadata server FDS Cluster [Nightingale et al., OSDI 2012] Jay Lorch, Microsoft Research Tardigrade 3

4 Example: FDS Metadata Service FDS Metadata server Paxos leader election FDS Cluster [Nightingale et al., OSDI 2012] Jay Lorch, Microsoft Research Tardigrade 4

5 Techniques for making code fault-tolerant have limitations Use state machine replication library Explicitly persist state to reliable back-end Better: Transparently make the binary Requires development resources fault-tolerant Potential for oversight Non-determinism Failing to persist state Exposing non-persisted data Bugs in crash recovery Jay Lorch, Microsoft Research Tardigrade 5

6 Outline Motivation Background: Asynchronous VM replication Our solution: Lightweight VM replication Challenges and solutions Evaluation Jay Lorch, Microsoft Research Tardigrade 6

7 Outline Motivation Background: Asynchronous VM replication Our solution: Lightweight VM replication Challenges and solutions Evaluation Jay Lorch, Microsoft Research Tardigrade 7

8 Asynchronous virtual machine replication - Remus [Cully et al., NSDI 2008] Δ primary backup Primary can crash at any time; backup is always a bit behind. Jay Lorch, Microsoft Research Tardigrade 8

9 Asynchronous virtual machine replication - Remus [Cully et al., NSDI 2008] primary backup Output buffer Jay Lorch, Microsoft Research Tardigrade 9

10 Asynchronous virtual machine replication - Remus [Cully et al., NSDI 2008] Ack(Δ) primary backup Output buffer Jay Lorch, Microsoft Research Tardigrade 10

11 Latency of ping (ms) High VM activity can delay packets Baseline Safety Scan Search Indexer Update Deduplication Processes unrelated to the service 10 can balloon client-perceived latency. 1 50th quantile 95th quantile 99th quantile 99.9th quantile Jay Lorch, Microsoft Research Tardigrade 11

12 Outline Motivation Background: Asynchronous VM replication Our solution: Lightweight VM replication Challenges and solutions Evaluation Jay Lorch, Microsoft Research Tardigrade 12

13 Our solution: Use lightweight VMs instead Other processes Lightweight VM system examples Xax [Douceur et al., OSDI 2008] Native Client [Sehr et al., IEEE S&P 2009] Drawbridge [Porter et al., ASPLOS 2011] Embassies [Howell et al., NSDI 2013] Bascule [Baumann et al., Eurosys 2013] Service process Narrow API (e.g., ~45 calls in Bascule) LVM host Host OS Jay Lorch, Microsoft Research Tardigrade 13

14 Lightweight VMs can support unmodified binaries via a library OS Service process LVM host LVM API Jay Lorch, Microsoft Research Tardigrade 14

15 Lightweight VMs can support unmodified binaries via a library OS Service process Service binary Library OS OS API Bascule has a Windows LibOS and a Linux LibOS LVM host LVM API Jay Lorch, Microsoft Research Tardigrade 15

16 A lightweight VM is encapsulated by virtue of having a narrow interface Service process Service binary Library OS OS API LVM host LVM API Jay Lorch, Microsoft Research Tardigrade 16

17 Our approach: Checkpoint by interposing on existing LVM API Checkpoint Service process Service binary Library OS OS API Interposition using existing API means LVM and LibOS don t have to change Checkpointer LVM host LVM API LVM API Jay Lorch, Microsoft Research Tardigrade 17

18 [Cully et al., NSDI 2008] Asynchronous Virtual Machine Replication Lightweight Virtual Machine Replication Service Library OS Checkpointing Host Service Library OS Checkpointing Host primary backup primary backup Jay Lorch, Microsoft Research Tardigrade 18

19 [Cully et al., NSDI 2008] Asynchronous Virtual Machine Replication Our implementation of LVMR is called Tardigrade Lightweight Virtual Machine Replication Guest (service+os) Checkpointing Host Guest (service+os) Checkpointing Host primary backup primary backup Jay Lorch, Microsoft Research Tardigrade 19

20 Outline Motivation Background: Asynchronous VM replication Our solution: Lightweight VM replication Challenges and solutions Evaluation Jay Lorch, Microsoft Research Tardigrade 20

21 Practical LVMR poses challenges See paper for details Challenges Maintaining consistency across reconfigurations Achieving performance potential Checkpointing via an existing LVM API Solutions Vertical Paxos Incremental checkpointing, Lessons checkpoint for LVM capping, parallelism, API designers scaling send buffer size Quiescing, pre-checkpointing, enforcing determinism, terminating connections Jay Lorch, Microsoft Research Tardigrade 21

22 Checkpointing uses certain LVM API Feature Ability to track changed memory pages features Purpose Efficiently compute checkpoint deltas Ability to suspend and inspect other threads Determinism when API calls are replayed Host state either replayable or regeneratable Capture consistent snapshot Prevent divergence on failover Recreate host state on backup Jay Lorch, Microsoft Research Tardigrade 22

23 Features may not always be in LVM Feature Ability to track changed memory pages APIs Workaround Missing Ability ability to suspend to suspend and and inspect other other threads Non-determinism Determinism when when API calls API calls are are replayed Host state not either replayable or or regeneratable Use exceptions, precheckpointing Hide non-determinism Expose divergence as error condition Jay Lorch, Microsoft Research Tardigrade 23

24 To capture a checkpoint, we must quiesce and capture all threads state. Guest (service + library OS) Memory Checkpointing layer What if the API doesn t let a thread suspend and inspect another thread? Host primary Jay Lorch, Microsoft Research Tardigrade 24

25 We can use exceptions to quiesce guest threads Guest (service + library OS) Checkpoint Checkpointing layer Host primary Jay Lorch, Microsoft Research Tardigrade 25

26 Exception handler quiesces and captures each guest thread s state Guest (service + library OS) Checkpoint ExceptionHandler(, ) Memory Checkpointing layer Host primary Jay Lorch, Microsoft Research Tardigrade 26

27 Synchronous system calls complicate quiescence Guest (service + library OS) Checkpointing layer Host primary Jay Lorch, Microsoft Research Tardigrade 27

28 The wait system call is easy to deal with Guest (service + library OS) Checkpointing layer Host select() file descriptor list 0x1AC 0x3BB 0x907 time-to-checkpoint primary Jay Lorch, Microsoft Research Tardigrade 28

29 General synchronous system calls require pre-checkpointing Guest (service + library OS) Checkpointing layer Host primary Jay Lorch, Microsoft Research Tardigrade 29

30 API non-determinism undermines replay CreateSemaphore() returns descriptor 0xAAA CreateSemaphore() returns descriptor 0xBBB primary backup Jay Lorch, Microsoft Research Tardigrade 30

31 An indirection table can hide nondeterminism Guest (service + library OS) Guest (service + library OS) Checkpointing layer Guest descriptor Host descriptor 0x001 0xAAA 0x002 0x932 Host Checkpointing layer Guest descriptor Host descriptor 0x001 0xBBB 0x002 0x909 Host primary backup Jay Lorch, Microsoft Research Tardigrade 31

32 State external to guest needs to be replayable or regeneratable Guest (service + library OS) LVM API Checkpointing layer API provides sockets, not packets Host LVM API TCP session state Checkpointer can t capture TCP session state! primary backup Jay Lorch, Microsoft Research Tardigrade 32

33 System-specific modifications may be necessary Guest (service + library OS) Guest (service + library OS) TCP connections get dropped on a failover. Checkpointing layer Checkpointing layer Fixing this requires a major API change to make it use Host packets rather than sockets Host TCP session state primary backup Jay Lorch, Microsoft Research Tardigrade 33

34 Outline Motivation Background: Asynchronous VM replication Our solution: Lightweight VM replication Challenges and solutions Evaluation Jay Lorch, Microsoft Research Tardigrade 34

35 Latency of ping (ms) Effect of external processes - Remus Baseline Safety Scan Search Indexer Update Deduplication th quantile 95th quantile 99th quantile 99.9th quantile Jay Lorch, Microsoft Research Tardigrade 35

36 Latency (ms) Effect of external processes - Tardigrade Baseline Safety Scan Search Indexer Update Deduplication % 95% 99% 99.9% Quantile Jay Lorch, Microsoft Research Tardigrade 36

37 Latency (ms) Effect of external processes - Tardigrade 25 Baseline Safety Scan Search Indexer Update Deduplication % 95% 99% 99.9% Quantile Jay Lorch, Microsoft Research Tardigrade 37

38 CDF (%) Memory dirtying affects checkpoint latency No dirtying 10% of net b/w 20% of net b/w 30% of net b/w 40% of net b/w 50% of net b/w Latency (ms) Jay Lorch, Microsoft Research Tardigrade 38

39 CDF (%) FDS metadata service Metadata server initially idle Cluster operating normally Checkpoint delta average size: 0.9 MB Cluster starting up Checkpoint delta average size: 1.8 MB Checkpoint interval (ms) Jay Lorch, Microsoft Research Tardigrade 39

40 CDF (%) ZKLite, a simple non-fault-tolerant Java implementation of the Zookeeper API Client request latency (ms) Jay Lorch, Microsoft Research Tardigrade 40

41 Conclusions No changes to binaries needed, making deployment simple Replicating processes rather than VMs substantially Lightweight reduces worst-case VM Examples replication latencyof good is targets: practical for making Metadata existing services Lightweight virtual machine API Coordination designers should services service consider binaries effect on fault-tolerant replication Niche web services Reasonable performance if memory dirtying rate and load are low Jay Lorch, Microsoft Research Tardigrade 41

Services in the Virtualization Plane. Andrew Warfield Adjunct Professor, UBC Technical Director, Citrix Systems

Services in the Virtualization Plane. Andrew Warfield Adjunct Professor, UBC Technical Director, Citrix Systems Services in the Virtualization Plane Andrew Warfield Adjunct Professor, UBC Technical Director, Citrix Systems The Virtualization Plane Applications Applications OS Physical Machine 20ms 20ms in in the

More information

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables International Conference on Computer, Networks and Communication Engineering (ICCNCE 2013) Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables Zhiyun Zheng, Huiling

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Application Fault Tolerance Using Continuous Checkpoint/Restart

Application Fault Tolerance Using Continuous Checkpoint/Restart Application Fault Tolerance Using Continuous Checkpoint/Restart Tomoki Sekiyama Linux Technology Center Yokohama Research Laboratory Hitachi Ltd. Outline 1. Overview of Application Fault Tolerance and

More information

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics 1 SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics Qin Liu, John C.S. Lui 1 Cheng He, Lujia Pan, Wei Fan, Yunlong Shi 2 1 The Chinese University of Hong Kong 2 Huawei Noah s

More information

G Xen and Nooks. Robert Grimm New York University

G Xen and Nooks. Robert Grimm New York University G22.3250-001 Xen and Nooks Robert Grimm New York University Agenda! Altogether now: The three questions! The (gory) details of Xen! We already covered Disco, so let s focus on the details! Nooks! The grand

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals... 2 1.2 Data model and the hierarchical namespace... 3 1.3 Nodes and ephemeral nodes...

More information

ExpressCluster X 2.1 Update Information 16 June 2009

ExpressCluster X 2.1 Update Information 16 June 2009 2.1 Update Information 16 June 2009 NEC Corporation Release Date 16 June, 2009 Release Outline Product Name 2.1 2.1 is a minor version up for 2.0. 2.0 license can be continuously used for 2.1 (Except for

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

VM Migration, Containers (Lecture 12, cs262a)

VM Migration, Containers (Lecture 12, cs262a) VM Migration, Containers (Lecture 12, cs262a) Ali Ghodsi and Ion Stoica, UC Berkeley February 28, 2018 (Based in part on http://web.eecs.umich.edu/~mosharaf/slides/eecs582/w16/021516-junchenglivemigration.pptx)

More information

Virtualization for Desktop Grid Clients

Virtualization for Desktop Grid Clients Virtualization for Desktop Grid Clients Marosi Attila Csaba atisu@sztaki.hu BOINC Workshop 09, Barcelona, Spain, 23/10/2009 Using Virtual Machines in Desktop Grid Clients for Application Sandboxing! Joint

More information

Introduction to MySQL InnoDB Cluster

Introduction to MySQL InnoDB Cluster 1 / 148 2 / 148 3 / 148 Introduction to MySQL InnoDB Cluster MySQL High Availability made easy Percona Live Europe - Dublin 2017 Frédéric Descamps - MySQL Community Manager - Oracle 4 / 148 Safe Harbor

More information

Eleos: Exit-Less OS Services for SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves Eleos: Exit-Less OS Services for SGX Enclaves Meni Orenbach Marina Minkin Pavel Lifshits Mark Silberstein Accelerated Computing Systems Lab Haifa, Israel What do we do? Improve performance: I/O intensive

More information

PhxSQL: A High-Availability & Strong-Consistency MySQL Cluster. Ming

PhxSQL: A High-Availability & Strong-Consistency MySQL Cluster. Ming PhxSQL: A High-Availability & Strong-Consistency MySQL Cluster Ming CHEN@WeChat Why PhxSQL Highly expected features for MySql cluster Availability and consistency in MySQL cluster Master-slaves replication

More information

Transparent TCP Recovery

Transparent TCP Recovery Transparent Recovery with Chain Replication Robert Burgess Ken Birman Robert Broberg Rick Payne Robbert van Renesse October 26, 2009 Motivation Us: Motivation Them: Client Motivation There is a connection...

More information

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors Junfeng Yang, Can Sar, Dawson Engler Stanford University Why check storage systems? Storage system errors are among the

More information

Abstract. Introduction

Abstract. Introduction Highly Available In-Memory Metadata Filesystem using Viewstamped Replication (https://github.com/pkvijay/metadr) Pradeep Kumar Vijay, Pedro Ulises Cuevas Berrueco Stanford cs244b-distributed Systems Abstract

More information

SimpleChubby: a simple distributed lock service

SimpleChubby: a simple distributed lock service SimpleChubby: a simple distributed lock service Jing Pu, Mingyu Gao, Hang Qu 1 Introduction We implement a distributed lock service called SimpleChubby similar to the original Google Chubby lock service[1].

More information

Deterministic Process Groups in

Deterministic Process Groups in Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +

More information

NPTEL Course Jan K. Gopinath Indian Institute of Science

NPTEL Course Jan K. Gopinath Indian Institute of Science Storage Systems NPTEL Course Jan 2012 (Lecture 39) K. Gopinath Indian Institute of Science Google File System Non-Posix scalable distr file system for large distr dataintensive applications performance,

More information

Better Security with Virtual Machines

Better Security with Virtual Machines Better Security with Virtual Machines VMware Security Seminar Cambridge, 2006 Agenda VMware Evolution Virtual machine Server architecture Virtual infrastructure Looking forward VMware s security vision

More information

Zhang Chen Zhang Chen Copyright 2017 FUJITSU LIMITED

Zhang Chen Zhang Chen Copyright 2017 FUJITSU LIMITED Introduce Introduction And And Status Status Update Update About About COLO COLO FT FT Zhang Chen Zhang Chen Agenda Background Introduction

More information

Composing OS Extensions Safely and Efficiently with Bascule

Composing OS Extensions Safely and Efficiently with Bascule Composing OS Extensions Safely and Efficiently with Bascule Andrew Baumann Dongyoon Lee Pedro Fonseca Lisa Glendenning Jacob R. Lorch Barry Bond Reuben Olinsky Galen C. Hunt Microsoft Research University

More information

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues By Narjess Ayari, Denis Barbaron, Laurent Lefevre and Pascale primet Presented by Mingyu Liu Outlines 1.Introduction

More information

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3. CHALLENGES Transparency: Slide 1 DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems ➀ Introduction ➁ NFS (Network File System) ➂ AFS (Andrew File System) & Coda ➃ GFS (Google File System)

More information

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Paxos Replicated State Machines as the Basis of a High- Performance Data Store Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a

More information

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, PushyDB Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, osong}@mit.edu https://github.com/jeffchan/6.824 1. Abstract PushyDB provides a more fully featured database that exposes

More information

iscsi Target Usage Guide December 15, 2017

iscsi Target Usage Guide December 15, 2017 December 15, 2017 1 Table of Contents 1. Native VMware Availability Options for vsan 1.1.Native VMware Availability Options for vsan 1.2.Application Clustering Solutions 1.3.Third party solutions 2. Security

More information

Advanced Memory Management

Advanced Memory Management Advanced Memory Management Main Points Applications of memory management What can we do with ability to trap on memory references to individual pages? File systems and persistent storage Goals Abstractions

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka. Jun Rao Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

vbuckets: The Core Enabling Mechanism for Couchbase Server Data Distribution (aka Auto-Sharding )

vbuckets: The Core Enabling Mechanism for Couchbase Server Data Distribution (aka Auto-Sharding ) vbuckets: The Core Enabling Mechanism for Data Distribution (aka Auto-Sharding ) Table of Contents vbucket Defined 3 key-vbucket-server ping illustrated 4 vbuckets in a world of s 5 TCP ports Deployment

More information

MySQL HA Solutions Selecting the best approach to protect access to your data

MySQL HA Solutions Selecting the best approach to protect access to your data MySQL HA Solutions Selecting the best approach to protect access to your data Sastry Vedantam sastry.vedantam@oracle.com February 2015 Copyright 2015, Oracle and/or its affiliates. All rights reserved

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Distributed Systems Final Exam

Distributed Systems Final Exam 15-440 Distributed Systems Final Exam Name: Andrew: ID December 12, 2011 Please write your name and Andrew ID above before starting this exam. This exam has 14 pages, including this title page. Please

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

OS Extensibility: SPIN and Exokernels. Robert Grimm New York University

OS Extensibility: SPIN and Exokernels. Robert Grimm New York University OS Extensibility: SPIN and Exokernels Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? OS Abstraction Barrier

More information

DMTCP: Fixing the Single Point of Failure of the ROS Master

DMTCP: Fixing the Single Point of Failure of the ROS Master DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

CARP: Handling silent data errors and site failures in an integrated program and storage replication mechanism

CARP: Handling silent data errors and site failures in an integrated program and storage replication mechanism CARP: Handling silent data errors and site failures in an integrated program and storage replication mechanism Lanyue Lu, Prasenjit Sarkar, Dinesh Subhraveti, Soumitra Sarkar, Mark Seaman, Reshu Jain,

More information

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016 Xen and the Art of Virtualization CSE-291 (Cloud Computing) Fall 2016 Why Virtualization? Share resources among many uses Allow heterogeneity in environments Allow differences in host and guest Provide

More information

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University MVAPICH Users Group 2016 Kapil Arya Checkpointing with DMTCP and MVAPICH2 for Supercomputing Kapil Arya Mesosphere, Inc. & Northeastern University DMTCP Developer Apache Mesos Committer kapil@mesosphere.io

More information

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 2015 Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2016 1 Question 1 Why did the use of reference counting for remote objects prove to be impractical? Explain. It s not fault

More information

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel Fault Tolerant Java Virtual Machine Roy Friedman and Alon Kama Technion Haifa, Israel Objective Create framework for transparent fault-tolerance Support legacy applications Intended for long-lived, highly

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Caching and reliability

Caching and reliability Caching and reliability Block cache Vs. Latency ~10 ns 1~ ms Access unit Byte (word) Sector Capacity Gigabytes Terabytes Price Expensive Cheap Caching disk contents in RAM Hit ratio h : probability of

More information

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences Performance and Forgiveness June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences Margo Seltzer Architect Outline A consistency primer Techniques and costs of consistency

More information

Creating the Fastest Possible Backups Using VMware Consolidated Backup. A Design Blueprint

Creating the Fastest Possible Backups Using VMware Consolidated Backup. A Design Blueprint Creating the Fastest Possible Backups Using VMware Consolidated Backup A Design Blueprint George Winter Technical Product Manager NetBackup Symantec Corporation Agenda Overview NetBackup for VMware and

More information

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW Virtual Machines Background IBM sold expensive mainframes to large organizations Some wanted to run different OSes at the same time (because applications were developed on old OSes) Solution: IBM developed

More information

Mehmet.Gonullu@Veeam.com Veeam Portfolio - Agentless backup and replication for VMware and Hyper-V - Scalable, powerful, easy-to-use, affordable - Storage agnostic High-speed Recovery instant VM recovery

More information

A Crash Course In Wide Area Data Replication. Jacob Farmer, CTO, Cambridge Computer

A Crash Course In Wide Area Data Replication. Jacob Farmer, CTO, Cambridge Computer A Crash Course In Wide Area Data Replication Jacob Farmer, CTO, Cambridge Computer SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals

More information

RemusDB: transparent high availability for database systems

RemusDB: transparent high availability for database systems The VLDB Journal (213) 22:29 45 DOI 1.17/s778-12-294-6 SPECIAL ISSUE PAPER RemusDB: transparent high availability for database systems Umar Farooq Minhas Shriram Rajagopalan Brendan Cully Ashraf Aboulnaga

More information

ProxySQL - GTID Consistent Reads. Adaptive query routing based on GTID tracking

ProxySQL - GTID Consistent Reads. Adaptive query routing based on GTID tracking ProxySQL - GTID Consistent Reads Adaptive query routing based on GTID tracking Introduction Rene Cannao Founder of ProxySQL MySQL DBA Introduction Nick Vyzas ProxySQL Committer MySQL DBA What is ProxySQL?

More information

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS Department of Computer Science Institute of System Architecture, Operating Systems Group MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS SWAMINATHAN SUNDARARAMAN, SRIRAM SUBRAMANIAN, ABHISHEK

More information

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace

More information

RDMA Migration and RDMA Fault Tolerance for QEMU

RDMA Migration and RDMA Fault Tolerance for QEMU RDMA Migration and RDMA Fault Tolerance for QEMU Michael R. Hines IBM mrhines@us.ibm.com http://wiki.qemu.org/features/rdmalivemigration http://wiki.qemu.org/features/microcheckpointing 1 Migration design

More information

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service) Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service) eddie.dong@intel.com arei.gonglei@huawei.com yanghy@cn.fujitsu.com Agenda Background Introduction Of COLO

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Asynchronous Events on Linux

Asynchronous Events on Linux Asynchronous Events on Linux Frederic.Rossi@Ericsson.CA Open System Lab Systems Research June 25, 2002 Ericsson Research Canada Introduction Linux performs well as a general purpose OS but doesn t satisfy

More information

Abstractions for Practical Virtual Machine Replay. Anton Burtsev, David Johnson, Mike Hibler, Eric Eride, John Regehr University of Utah

Abstractions for Practical Virtual Machine Replay. Anton Burtsev, David Johnson, Mike Hibler, Eric Eride, John Regehr University of Utah Abstractions for Practical Virtual Machine Replay Anton Burtsev, David Johnson, Mike Hibler, Eric Eride, John Regehr University of Utah 2 3 Number of systems supporting replay: 0 Determinism 4 CPU is deterministic

More information

Hedvig as backup target for Veeam

Hedvig as backup target for Veeam Hedvig as backup target for Veeam Solution Whitepaper Version 1.0 April 2018 Table of contents Executive overview... 3 Introduction... 3 Solution components... 4 Hedvig... 4 Hedvig Virtual Disk (vdisk)...

More information

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 ZEST Snapshot Service A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1 Design Motivation To optimize science utilization of the machine Maximize

More information

Xen Community Update. Ian Pratt, Citrix Systems and Chairman of Xen.org

Xen Community Update. Ian Pratt, Citrix Systems and Chairman of Xen.org Xen Community Update Ian Pratt, Citrix Systems and Chairman of Xen.org 1 Outline Project Status Xen Client Initiative Xen Cloud Platform New Xen 4.0 Features 2 Announcement The Xen Advisory Board is excited

More information

IO-Lite: A Unified I/O Buffering and Caching System

IO-Lite: A Unified I/O Buffering and Caching System IO-Lite: A Unified I/O Buffering and Caching System Vivek S. Pai, Peter Druschel and Willy Zwaenepoel Rice University (Presented by Chuanpeng Li) 2005-4-25 CS458 Presentation 1 IO-Lite Motivation Network

More information

Microservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.

Microservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems. Microservices, Messaging and Science Gateways Review microservices for science gateways and then discuss messaging systems. Micro- Services Distributed Systems DevOps The Gateway Octopus Diagram Browser

More information

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego Engineering Fault-Tolerant TCP/IP servers using FT-TCP Dmitrii Zagorodnov University of California San Diego Motivation Reliable network services are desirable but costly! Extra and/or specialized hardware

More information

DRBD 9. Lars Ellenberg. Linux Storage Replication. LINBIT HA Solutions GmbH Vienna, Austria

DRBD 9. Lars Ellenberg. Linux Storage Replication. LINBIT HA Solutions GmbH Vienna, Austria DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria What this talk is about What is replication Why block level replication Why replication What do we have to deal

More information

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX / MySQL High Availability Michael Messina Senior Managing Consultant, Rolta-AdvizeX mmessina@advizex.com / mike.messina@rolta.com Introduction Michael Messina Senior Managing Consultant Rolta-AdvizeX, Working

More information

Mobile Operating Systems Lesson 04 PalmOS Part 2

Mobile Operating Systems Lesson 04 PalmOS Part 2 Mobile Operating Systems Lesson 04 PalmOS Part 2 Oxford University Press 2007. All rights reserved. 1 PalmOS Memory Support Assumes that there is a 256 MB memory card(s) The card RAM, ROM, and flash memories

More information

The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines

The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines Daniel J. Scales (VMware) scales@vmware.com Mike Nelson (VMware) mnelson@vmware.com Ganesh Venkitachalam (VMware) ganesh@vmware.com

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

Lightweight Live Migration for High Availability Cluster Service

Lightweight Live Migration for High Availability Cluster Service Lightweight Live Migration for High Availability Cluster Service Bo Jiang 1, Binoy Ravindran 1, and Changsoo Kim 2 1 ECE Dept., Virginia Tech {bjiang,binoy}@vt.edu 2 ETRI, Daejeon, South Korea cskim7@etri.re.kr

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Status Update About COLO FT

Status Update About COLO FT Status Update About COLO FT www.huawei.com Hailiang Zhang (Huawei) Randy Han (Huawei) Agenda Introduce COarse-grain LOck-stepping COLO Design and Technology Details Current Status Of COLO In KVM Further

More information

Enhancing the Performance of High Availability Lightweight Live Migration

Enhancing the Performance of High Availability Lightweight Live Migration Enhancing the Performance of High Availability Lightweight Live Migration Peng Lu 1, Binoy Ravindran 1, and Changsoo Kim 2 1 ECE Dept., Virginia Tech {lvpeng,binoy}@vt.edu 2 ETRI, Daejeon, South Korea

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

Remote Procedure Call (RPC) and Transparency

Remote Procedure Call (RPC) and Transparency Remote Procedure Call (RPC) and Transparency Brad Karp UCL Computer Science CS GZ03 / M030 10 th October 2014 Transparency in Distributed Systems Programmers accustomed to writing code for a single box

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture4:Failure& Fault-tolerant Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the

More information

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong Relatively recent; still applicable today GFS: Google s storage platform for the generation and processing of data used by services

More information

Lecture XIII: Replication-II

Lecture XIII: Replication-II Lecture XIII: Replication-II CMPT 401 Summer 2007 Dr. Alexandra Fedorova Outline Google File System A real replicated file system Paxos Harp A consensus algorithm used in real systems A replicated research

More information

Cycle Sharing Systems

Cycle Sharing Systems Cycle Sharing Systems Jagadeesh Dyaberi Dependable Computing Systems Lab Purdue University 10/31/2005 1 Introduction Design of Program Security Communication Architecture Implementation Conclusion Outline

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION Installing and Configuring the DM-MPIO WHITE PAPER INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION Abstract This white paper introduces XtremIO replication on X2 platforms. XtremIO replication leverages

More information

How VoltDB does Transactions

How VoltDB does Transactions TEHNIL NOTE How does Transactions Note to readers: Some content about read-only transaction coordination in this document is out of date due to changes made in v6.4 as a result of Jepsen testing. Updates

More information

Petal and Frangipani

Petal and Frangipani Petal and Frangipani Petal/Frangipani NFS NAS Frangipani SAN Petal Petal/Frangipani Untrusted OS-agnostic NFS FS semantics Sharing/coordination Frangipani Disk aggregation ( bricks ) Filesystem-agnostic

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Data Storage Revolution

Data Storage Revolution Data Storage Revolution Relational Databases Object Storage (put/get) Dynamo PNUTS CouchDB MemcacheDB Cassandra Speed Scalability Availability Throughput No Complexity Eventual Consistency Write Request

More information

CSE543 - Computer and Network Security Module: Virtualization

CSE543 - Computer and Network Security Module: Virtualization CSE543 - Computer and Network Security Module: Virtualization Professor Trent Jaeger CSE543 - Introduction to Computer and Network Security 1 1 Operating System Quandary Q: What is the primary goal of

More information

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group Operating Systems (2INC0) 20/19 Introduction (01) Dr. Courtesy of Prof. Dr. Johan Lukkien System Architecture and Networking Group Course Overview Introduction to operating systems Processes, threads and

More information

status Emmanuel Cecchet

status Emmanuel Cecchet status Emmanuel Cecchet c-jdbc@objectweb.org JOnAS developer workshop http://www.objectweb.org - c-jdbc@objectweb.org 1-23/02/2004 Outline Overview Advanced concepts Query caching Horizontal scalability

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

VMware vsphere 5.5 Professional Bootcamp

VMware vsphere 5.5 Professional Bootcamp VMware vsphere 5.5 Professional Bootcamp Course Overview Course Objectives Cont. VMware vsphere 5.5 Professional Bootcamp is our most popular proprietary 5 Day course with more hands-on labs (100+) and

More information

Cross Data Center Replication in Apache Solr. Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer

Cross Data Center Replication in Apache Solr. Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer Cross Data Center Replication in Apache Solr Anvi Jain, Software Engineer II Amrit Sarkar, Search Engineer Who are we? Based in Bedford, MA. Offices all around the world Progress tools and platforms enable

More information

Operating Systems 2010/2011

Operating Systems 2010/2011 Operating Systems 2010/2011 Introduction Johan Lukkien 1 Agenda OS: place in the system Some common notions Motivation & OS tasks Extra-functional requirements Course overview Read chapters 1 + 2 2 A computer

More information

Aurora, RDS, or On-Prem, Which is right for you

Aurora, RDS, or On-Prem, Which is right for you Aurora, RDS, or On-Prem, Which is right for you Kathy Gibbs Database Specialist TAM Katgibbs@amazon.com Santa Clara, California April 23th 25th, 2018 Agenda RDS Aurora EC2 On-Premise Wrap-up/Recommendation

More information

QEMU Backup. Maxim Nestratov, Virtuozzo Vladimir Sementsov-Ogievskiy, Virtuozzo

QEMU Backup. Maxim Nestratov, Virtuozzo Vladimir Sementsov-Ogievskiy, Virtuozzo QEMU Backup Maxim Nestratov, Virtuozzo Vladimir Sementsov-Ogievskiy, Virtuozzo QEMU Backup Vladimir Sementsov-Ogievskiy, Virtuozzo Full featured backup Online backup Fast Not very invasive for the guest

More information

An Empirical Study of High Availability in Stream Processing Systems

An Empirical Study of High Availability in Stream Processing Systems An Empirical Study of High Availability in Stream Processing Systems Yu Gu, Zhe Zhang, Fan Ye, Hao Yang, Minkyong Kim, Hui Lei, Zhen Liu Stream Processing Model software operators (PEs) Ω Unexpected machine

More information