Failures in Distributed Systems

Similar documents
Heading Off Correlated Failures through Independence-as-a-Service

Cloud Computing. Ennan Zhai. Computer Science at Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Replications and Consensus

SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

TROPIC: Transactional Resource Orchestration Platform In the Cloud

Principal Solutions Architect. Architecting in the Cloud

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

Welcome to the New Era of Cloud Computing

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

AWS_SOA-C00 Exam. Volume: 758 Questions

Distributed Systems Homework 1 (6 problems)

White Paper Amazon Aurora A Fast, Affordable and Powerful RDBMS

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Cryptographic Systems

PracticeDump. Free Practice Dumps - Unlimited Free Access of practice exam

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Introduction to data centers

There Is More Consensus in Egalitarian Parliaments

Moving to the Cloud: Making It Happen With MarkLogic

Amazon Aurora Deep Dive

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

How we built a highly scalable Machine Learning platform using Apache Mesos

CISC 7610 Lecture 2b The beginnings of NoSQL

Lessons learned while automating MySQL in the AWS cloud. Stephane Combaudon DB Engineer - Slice

Mechanisms and Architectures for Tail-Tolerant System Operations in Cloud

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

Designing Fault-Tolerant Applications

SpecPaxos. James Connolly && Harrison Davis

CS 425 / ECE 428 Distributed Systems Fall 2017

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

About Intellipaat. About the Course. Why Take This Course?

Intuitive distributed algorithms. with F#

ZooKeeper Atomic Broadcast

Aurora, RDS, or On-Prem, Which is right for you

Simple testing can prevent most critical failures -- An analysis of production failures in distributed data-intensive systems

MIGRATING SAP WORKLOADS TO AWS CLOUD

Amazon Aurora Deep Dive

Consistency in Distributed Systems

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

Getting Started with Amazon EC2 and Amazon SQS

Distributed Systems COMP 212. Lecture 18 Othon Michail

Project Presentation

UseNet and Gossip Protocol

Troubleshooting VoIP in Converged Networks

Amazon Web Services. Block 402, 4 th Floor, Saptagiri Towers, Above Pantaloons, Begumpet Main Road, Hyderabad Telangana India

Distributed Systems CS6421

MapReduce & BigTable

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

CIT 668: System Architecture. Amazon Web Services

Deploy. A step-by-step guide to successfully deploying your new app with the FileMaker Platform

Exam Distributed Systems

Hosting DesktopNow in Amazon Web Services. Ivanti DesktopNow powered by AppSense

Applications of Paxos Algorithm

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB

WHITEPAPER AMAZON ELB: Your Master Key to a Secure, Cost-Efficient and Scalable Cloud.

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about

Security and Compliance at Mavenlink

RA-GRS, 130 replication support, ZRS, 130

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Amazon Aurora Deep Dive

Hadoop/MapReduce Computing Paradigm

Course Information. Course Basics. Course Format. Arvind Krishnamurthy. Instructor: Arvind Krishnamurthy. TA: Naveen Sharma

PROTECT YOUR DATA FROM MALWARE AND ENSURE BUSINESS CONTINUITY ON THE CLOUD WITH NAVLINK MANAGED AMAZON WEB SERVICES MANAGED AWS

Pega 7 Webinar Series

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Introduction to Database Services

Case Study Virtual Expo. Virtual Expo, a leader in online B2B exhibitions, used Varnish Plus Cloud to build and enhance their own in-house CDN

Persistent Storage with Docker in production - Which solution and why?

Abstractions for Model Checking SDN Controllers. Divjyot Sethi, Srinivas Narayana, Prof. Sharad Malik Princeton University

Security of End User based Cloud Services Sang Young

Introduction to Cloud Computing

Data Protection Done Right with Dell EMC and AWS

MapR Enterprise Hadoop

Hyperconverged Cloud Architecture with OpenNebula and StorPool

Training on Amazon AWS Cloud Computing. Course Content

High Availability and Disaster Recovery. Cherry Lin, Jonathan Quinn

How to backup Ceph at scale. FOSDEM, Brussels,

vsphere Replication for Disaster Recovery to Cloud

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

CLOUD COMPUTING. A public cloud sells services to anyone on the Internet. The cloud infrastructure is made available to

Embracing Failure. Fault Injec,on and Service Resilience at Ne6lix. Josh Evans Director of Opera,ons Engineering, Ne6lix

Amazon Web Services Training. Training Topics:

Everything You Need to Know About MySQL Group Replication

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Unlimited Scalability in the Cloud A Case Study of Migration to Amazon DynamoDB

MySQL Enterprise High Availability

Container Orchestration on Amazon Web Services. Arun

Protecting remote site data SvSAN clustering - failure scenarios

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions

Automated Upgrades of PostgreSQL Clusters in Cloud

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Transcription:

CPSC 426/526 Failures in Distributed Systems Ennan Zhai Computer Science Department Yale University

Recall: Lec-14 In lec-14, we learned: - Difference between privacy and anonymity - Anonymous communications - Case study: Dissent [OSDI 12]

Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

Failures in Distributed Systems What is the meaning of failure: - Your system does not function correctly - Your system implementation does not meet your spec - What is your intuition when you say some code has bugs?

Why we care about failures? Data Center Outages Generate Big Losses Downtime in a data center can cost an average of $740,357 per incident -- cloud outage cost report in 2016.

Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters -......

Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters -......

Reliability & Availability Correlated failures resulting from EBS due to bugs in one EBS server Summary of the October 22, 2012 AWS Service Event in the US-East Region We d like to share more about the service event that occurred on Monday, October 22nd in the US-East Region. We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future. The Primary Event and the Impact to Amazon Elastic Block Store (EBS) and Amazon Elastic Compute Cloud (EC2)

Reliability & Availability Elastic Compute Cloud (EC2) Elastic Block Store (EBS)

Reliability & Availability...... Elastic Block Store (EBS)

Reliability & Availability VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ Elastic Block Store (EBS)

Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Cluster1 EBS Cluster2

Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Cluster1 EBS Cluster2

Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Cluster1 EBS Cluster2

Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4............ EBS Cluster1 EBS Cluster2

Global Outages

A more tricky example

Video App Cloud Provider A Cloud Provider B

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

Video App Cloud Provider A Cloud Provider B Cloud providers do not usually share Third-party infrastructure components information about their dependencies ISP Router A ISP Router B ISP Router C Power Source

Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters -......

Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters -......

Consistency Failures ZooKeeper (synchronization service) Issue #335. 1. Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5. C commits new txid-value pair (11, X) 6. A crashes, before committing the new txid 11 7. C loses quorum and C crashes 8. A and B are back online after C crashes 9. A becomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online after A's new tx commit 12. C announce to B (11, X) 13. B replies diff starting with tx 12 14. Inconsistency: A has (11, Y), C has (11, X) PERMANENT INCONSISTENT REPLICA

Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters -......

Scalability Failures Large data In HBase Tens of minutes Insufficient lookup opera;on R1 R R2 R100K R3

Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

Root Causes Customer Environment 25% 9% 31% Misconfiguration Hardware Bugs 15% 20% 5 Storage Systems, 2011

Root Causes Customer Environment 25% 9% 31% Misconfiguration Hardware Bugs 15% 20% More Cloud Systems 5 Storage Systems, 2011

Root-cause types: Logic Logic (29%) Many domain-specific issues 48

Root-cause types: Error handling Logic Error handling (18%) 49

Example

Root-cause types: Op/miza/on Logic Error handling Op#miza#on (15%) 55

Root-cause types: Configura3on Logic Error handling Op3miza3on Configura)on (14%) 56

Example Configuration File Specify parameters MySQL

Example

Example

Root-cause types: Race Race (12%) < 50% local concurrency bugs Buggy thread interleaving Tons of work > 50% distributed concurrency bugs Reordering of messages, crashes, Cmeouts More work is needed SAMC [OSDI 14] 69

Root-cause types: Hang Hang (4%) Classical deadlock Un-served jobs, stalled opera8ons, Root causes? How to detect them? 70

Root-cause types: Space Space (4%) Big data + leak = Big leak Clean-up opera:ons must be flawless. 72

Root-case types: Load Load (4%) Happen when systems face high request load Relates to QoS and admission control 73

Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

Failure Handling in Distributed Systems

Failure Handling in Distributed Systems Service initialization Changing network paths Upgrading software components Outage Service Runtime

Failure Handling in Distributed Systems Post-Failure Troubleshooting Failure Prevention 1. Diagnosis tools 2. Accountability 3. Provenance 4....... Service initialization Changing network paths Upgrading software components Outage Service Runtime

Failure Handling in Distributed Systems Post-Failure Troubleshooting Failure Prevention Failure Tolerance 1. Diagnosis tools 2. Accountability 3. Provenance 4....... Service initialization Changing network paths Upgrading software components Outage Service Runtime

Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

Next Lecture In the lec-16, I will learn: - Misconfigurations in distributed systems - How important they are - How to handle misconfiguration