Failures in Distributed Systems

Size: px
Start display at page:

Download "Failures in Distributed Systems"

Transcription

1 CPSC 426/526 Failures in Distributed Systems Ennan Zhai Computer Science Department Yale University

2 Recall: Lec-14 In lec-14, we learned: - Difference between privacy and anonymity - Anonymous communications - Case study: Dissent [OSDI 12]

3 Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

4 Failures in Distributed Systems What is the meaning of failure: - Your system does not function correctly - Your system implementation does not meet your spec - What is your intuition when you say some code has bugs?

5 Why we care about failures? Data Center Outages Generate Big Losses Downtime in a data center can cost an average of $740,357 per incident -- cloud outage cost report in 2016.

6 Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters

7 Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters

8 Reliability & Availability Correlated failures resulting from EBS due to bugs in one EBS server Summary of the October 22, 2012 AWS Service Event in the US-East Region We d like to share more about the service event that occurred on Monday, October 22nd in the US-East Region. We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future. The Primary Event and the Impact to Amazon Elastic Block Store (EBS) and Amazon Elastic Compute Cloud (EC2)

9 Reliability & Availability Elastic Compute Cloud (EC2) Elastic Block Store (EBS)

10 Reliability & Availability Elastic Block Store (EBS)

11 Reliability & Availability VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM Elastic Block Store (EBS)

12 Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM EBS Cluster1 EBS Cluster2

13 Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM EBS Cluster1 EBS Cluster2

14 Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM EBS Cluster1 EBS Cluster2

15 Netflix VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM EBS Cluster1 EBS Cluster2

16 Global Outages

17 A more tricky example

18

19 Video App Cloud Provider A Cloud Provider B

20 Video App Cloud Provider A Cloud Provider B Third-party infrastructure components

21 Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C

22 Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

23 Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

24 Video App Cloud Provider A Cloud Provider B Third-party infrastructure components ISP Router A ISP Router B ISP Router C Power Source

25 Video App Cloud Provider A Cloud Provider B Cloud providers do not usually share Third-party infrastructure components information about their dependencies ISP Router A ISP Router B ISP Router C Power Source

26 Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters

27 Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters

28 Consistency Failures ZooKeeper (synchronization service) Issue # Nodes A, B, C start (w/ latex txid: 10) 2. B becomes leader 3. B crashes 4. C becomes leader 5. C commits new txid-value pair (11, X) 6. A crashes, before committing the new txid C loses quorum and C crashes 8. A and B are back online after C crashes 9. A becomes leader 10. A's commits new txid-value pair (11, Y) 11. C is back online after A's new tx commit 12. C announce to B (11, X) 13. B replies diff starting with tx Inconsistency: A has (11, Y), C has (11, X) PERMANENT INCONSISTENT REPLICA

29 Failures in Distributed Systems Different aspects of system failures: - Reliability: Operation & job failures, data loss & corruption - Performance: Traffic overloaded, connection delay - Availability: Node and cluster downtime - Consistency: Permanent inconsistent replicas - Scalability: Function incorrectly in large-scale clusters

30 Scalability Failures Large data In HBase Tens of minutes Insufficient lookup opera;on R1 R R2 R100K R3

31 Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

32 Root Causes Customer Environment 25% 9% 31% Misconfiguration Hardware Bugs 15% 20% 5 Storage Systems, 2011

33 Root Causes Customer Environment 25% 9% 31% Misconfiguration Hardware Bugs 15% 20% More Cloud Systems 5 Storage Systems, 2011

34 Root-cause types: Logic Logic (29%) Many domain-specific issues 48

35 Root-cause types: Error handling Logic Error handling (18%) 49

36 Example

37 Root-cause types: Op/miza/on Logic Error handling Op#miza#on (15%) 55

38 Root-cause types: Configura3on Logic Error handling Op3miza3on Configura)on (14%) 56

39 Example Configuration File Specify parameters MySQL

40 Example

41 Example

42 Root-cause types: Race Race (12%) < 50% local concurrency bugs Buggy thread interleaving Tons of work > 50% distributed concurrency bugs Reordering of messages, crashes, Cmeouts More work is needed SAMC [OSDI 14] 69

43 Root-cause types: Hang Hang (4%) Classical deadlock Un-served jobs, stalled opera8ons, Root causes? How to detect them? 70

44 Root-cause types: Space Space (4%) Big data + leak = Big leak Clean-up opera:ons must be flawless. 72

45 Root-case types: Load Load (4%) Happen when systems face high request load Relates to QoS and admission control 73

46 Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

47 Failure Handling in Distributed Systems

48 Failure Handling in Distributed Systems Service initialization Changing network paths Upgrading software components Outage Service Runtime

49 Failure Handling in Distributed Systems Post-Failure Troubleshooting Failure Prevention 1. Diagnosis tools 2. Accountability 3. Provenance Service initialization Changing network paths Upgrading software components Outage Service Runtime

50 Failure Handling in Distributed Systems Post-Failure Troubleshooting Failure Prevention Failure Tolerance 1. Diagnosis tools 2. Accountability 3. Provenance Service initialization Changing network paths Upgrading software components Outage Service Runtime

51 Lecture Roadmap Failures in Distributed Systems Root Causes How to handle failures Case Study: Catastrophic Failures

52 Next Lecture In the lec-16, I will learn: - Misconfigurations in distributed systems - How important they are - How to handle misconfiguration

Heading Off Correlated Failures through Independence-as-a-Service

Heading Off Correlated Failures through Independence-as-a-Service Heading Off Correlated Failures through Independence-as-a-Service Ennan Zhai 1 Ruichuan Chen 2, David Isaac Wolinsky 1, Bryan Ford 1 1 Yale University 2 Bell Labs/Alcatel-Lucent Background Cloud services

More information

Cloud Computing. Ennan Zhai. Computer Science at Yale University

Cloud Computing. Ennan Zhai. Computer Science at Yale University Cloud Computing Ennan Zhai Computer Science at Yale University ennan.zhai@yale.edu About Final Project About Final Project Important dates before demo session: - Oct 31: Proposal v1.0 - Nov 7: Source code

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information

Replications and Consensus

Replications and Consensus CPSC 426/526 Replications and Consensus Ennan Zhai Computer Science Department Yale University Recall: Lec-8 and 9 In the lec-8 and 9, we learned: - Cloud storage and data processing - File system: Google

More information

SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems

SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems SAMC: Sema+c- Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi *, Jeffrey F. Lukman, and Haryadi S. Gunawi * 1 Internet Services

More information

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University CPSC / PP Lookup Service Ennan Zhai Computer Science Department Yale University Recall: Lec- Network basics: - OSI model and how Internet works - Socket APIs red PP network (Gnutella, KaZaA, etc.) UseNet

More information

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION Steve Bertoldi, Solutions Director, MarkLogic Agenda Cloud computing and on premise issues Comparison of traditional vs cloud architecture Review of use

More information

TROPIC: Transactional Resource Orchestration Platform In the Cloud

TROPIC: Transactional Resource Orchestration Platform In the Cloud TROPIC: Transactional Resource Orchestration Platform In the Cloud Changbin Liu, Yun Mao*, Xu Chen*, Mary Fernandez*, Boon Thau Loo, Jacobus Van der Merwe* * netdb.cis.upenn.edu/dmf 1 Motivation Infrastructure

More information

Principal Solutions Architect. Architecting in the Cloud

Principal Solutions Architect. Architecting in the Cloud Matt Tavis Principal Solutions Architect Architecting in the Cloud Cloud Best Practices Whitepaper Prescriptive guidance to Cloud Architects Just Search for Cloud Best Practices to find the link ttp://media.amazonwebservices.co

More information

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud? DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing Slide 1 Slide 3 ➀ What is Cloud Computing? ➁ X as a Service ➂ Key Challenges ➃ Developing for the Cloud Why is it called Cloud? services provided

More information

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University CPSC 4/5 PP Lookup Service Ennan Zhai Computer Science Department Yale University Recall: Lec- Network basics: - OSI model and how Internet works - Socket APIs red PP network (Gnutella, KaZaA, etc.) UseNet

More information

Welcome to the New Era of Cloud Computing

Welcome to the New Era of Cloud Computing Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing

More information

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12 Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, Netflix @eatupmartha #netflixcloud #cassandra12 1 June 29, 2012 2 3 4 [1] 5 From the Netflix tech blog: Cassandra, our distributed cloud persistence

More information

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo Document Sub Title Yotpo Technical Overview 07/18/2016 2015 Yotpo Contents Introduction... 3 Yotpo Architecture... 4 Yotpo Back Office (or B2B)... 4 Yotpo On-Site Presence... 4 Technologies... 5 Real-Time

More information

AWS_SOA-C00 Exam. Volume: 758 Questions

AWS_SOA-C00 Exam. Volume: 758 Questions Volume: 758 Questions Question: 1 A user has created photo editing software and hosted it on EC2. The software accepts requests from the user about the photo format and resolution and sends a message to

More information

Distributed Systems Homework 1 (6 problems)

Distributed Systems Homework 1 (6 problems) 15-440 Distributed Systems Homework 1 (6 problems) Due: November 30, 11:59 PM via electronic handin Hand in to Autolab in PDF format November 29, 2011 1. You have set up a fault-tolerant banking service.

More information

White Paper Amazon Aurora A Fast, Affordable and Powerful RDBMS

White Paper Amazon Aurora A Fast, Affordable and Powerful RDBMS White Paper Amazon Aurora A Fast, Affordable and Powerful RDBMS TABLE OF CONTENTS Introduction 3 Multi-Tenant Logging and Storage Layer with Service-Oriented Architecture 3 High Availability with Self-Healing

More information

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs

More information

Cryptographic Systems

Cryptographic Systems CPSC 426/526 Cryptographic Systems Ennan Zhai Computer Science Department Yale University Recall: Lec-10 In lec-10, we learned: - Consistency models - Two-phase commit - Consensus - Paxos Lecture Roadmap

More information

PracticeDump. Free Practice Dumps - Unlimited Free Access of practice exam

PracticeDump.   Free Practice Dumps - Unlimited Free Access of practice exam PracticeDump http://www.practicedump.com Free Practice Dumps - Unlimited Free Access of practice exam Exam : AWS-Developer Title : AWS Certified Developer - Associate Vendor : Amazon Version : DEMO Get

More information

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved BERLIN 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Amazon Aurora: Amazon s New Relational Database Engine Carlos Conde Technology Evangelist @caarlco 2015, Amazon Web Services,

More information

Introduction to data centers

Introduction to data centers Introduction to data centers Paolo Giaccone Notes for the class on Switching technologies for data centers Politecnico di Torino December 2017 Cloud computing Section 1 Cloud computing Giaccone (Politecnico

More information

There Is More Consensus in Egalitarian Parliaments

There Is More Consensus in Egalitarian Parliaments There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David Andersen, Michael Kaminsky Carnegie Mellon University Intel Labs Fault tolerance Redundancy State Machine Replication 3 State Machine

More information

Moving to the Cloud: Making It Happen With MarkLogic

Moving to the Cloud: Making It Happen With MarkLogic Moving to the Cloud: Making It Happen With MarkLogic Alex Bleasdale, Manager, Support, MarkLogic COPYRIGHT 20 April 2017 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Agenda 1. MarkLogic in the cloud 2.

More information

Amazon Aurora Deep Dive

Amazon Aurora Deep Dive Amazon Aurora Deep Dive Anurag Gupta VP, Big Data Amazon Web Services April, 2016 Up Buffer Quorum 100K to Less Proactive 1/10 15 caches Custom, Shared 6-way Peer than read writes/second Automated Pay

More information

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS suneys@amazon.com AWS Core Infrastructure and Services Traditional Infrastructure Amazon Web Services Security Security Firewalls ACLs

More information

How we built a highly scalable Machine Learning platform using Apache Mesos

How we built a highly scalable Machine Learning platform using Apache Mesos How we built a highly scalable Machine Learning platform using Apache Mesos Daniel Sârbe Development Manager, BigData and Cloud Machine Translation @ SDL Co-founder of BigData/DataScience Meetup Cluj,

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Lessons learned while automating MySQL in the AWS cloud. Stephane Combaudon DB Engineer - Slice

Lessons learned while automating MySQL in the AWS cloud. Stephane Combaudon DB Engineer - Slice Lessons learned while automating MySQL in the AWS cloud Stephane Combaudon DB Engineer - Slice Our environment 5 DB stacks Data volume ranging from 30GB to 2TB+. Master + N slaves for each stack. Master

More information

Mechanisms and Architectures for Tail-Tolerant System Operations in Cloud

Mechanisms and Architectures for Tail-Tolerant System Operations in Cloud Mechanisms and Architectures for Tail-Tolerant System Operations in Cloud Qinghua Lu 1,2, Liming Zhu 2, Xiwei Xu 2, Len Bass 2, Shanshan Li 1, Weishan Zhang 1, Ning Wang 1 1 College of Computer and Communication

More information

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options Choosing a MySQL HA Solution Today Choosing the best solution among a myriad of options Questions...Questions...Questions??? How to zero in on the right solution You can t hit a target if you don t have

More information

Designing Fault-Tolerant Applications

Designing Fault-Tolerant Applications Designing Fault-Tolerant Applications Miles Ward Enterprise Solutions Architect Building Fault-Tolerant Applications on AWS White paper published last year Sharing best practices We d like to hear your

More information

SpecPaxos. James Connolly && Harrison Davis

SpecPaxos. James Connolly && Harrison Davis SpecPaxos James Connolly && Harrison Davis Overview Background Fast Paxos Traditional Paxos Implementations Data Centers Mostly-Ordered-Multicast Network layer Speculative Paxos Protocol Application layer

More information

CS 425 / ECE 428 Distributed Systems Fall 2017

CS 425 / ECE 428 Distributed Systems Fall 2017 CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy) Nov 7, 2017 Lecture 21: Replication Control All slides IG Server-side Focus Concurrency Control = how to coordinate multiple concurrent

More information

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Designing Distributed Systems using Approximate Synchrony in Data Center Networks Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Naveen Kr. Sharma Vincent Liu Arvind Krishnamurthy University of Washington CSE Today s most

More information

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX KillTest Q&A Exam : AWS-SysOps Title : AWS Certified SysOps Administrator Associate Version : Demo 1 / 4 1.A user has created photo editing software and hosted it on EC2. The software accepts requests

More information

About Intellipaat. About the Course. Why Take This Course?

About Intellipaat. About the Course. Why Take This Course? About Intellipaat Intellipaat is a fast growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

ZooKeeper Atomic Broadcast

ZooKeeper Atomic Broadcast ZooKeeper Atomic Broadcast The heart of the ZooKeeper coordination service Benjamin Reed, Flavio Junqueira Yahoo! Research ZooKeeper Service Transforms a request into an idempotent transaction Request

More information

Aurora, RDS, or On-Prem, Which is right for you

Aurora, RDS, or On-Prem, Which is right for you Aurora, RDS, or On-Prem, Which is right for you Kathy Gibbs Database Specialist TAM Katgibbs@amazon.com Santa Clara, California April 23th 25th, 2018 Agenda RDS Aurora EC2 On-Premise Wrap-up/Recommendation

More information

Simple testing can prevent most critical failures -- An analysis of production failures in distributed data-intensive systems

Simple testing can prevent most critical failures -- An analysis of production failures in distributed data-intensive systems Simple testing can prevent most critical failures -- An analysis of production failures in distributed data-intensive systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang,

More information

MIGRATING SAP WORKLOADS TO AWS CLOUD

MIGRATING SAP WORKLOADS TO AWS CLOUD MIGRATING SAP WORKLOADS TO AWS CLOUD Cloud is an increasingly credible and powerful infrastructure alternative for critical business applications. It s a great way to avoid capital expenses and maintenance

More information

Amazon Aurora Deep Dive

Amazon Aurora Deep Dive Amazon Aurora Deep Dive Enterprise-class database for the cloud Damián Arregui, Solutions Architect, AWS October 27 th, 2016 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enterprise

More information

Consistency in Distributed Systems

Consistency in Distributed Systems Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in scale and widely distributed 1. concurrent execution of components 2. independent failure modes 3. transmission

More information

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona Highly Available Database Architectures in AWS Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona Hello, Percona Live Attendees! What this talk is meant to

More information

Getting Started with Amazon EC2 and Amazon SQS

Getting Started with Amazon EC2 and Amazon SQS Getting Started with Amazon EC2 and Amazon SQS Building Scalable, Reliable Amazon EC2 Applications with Amazon SQS Overview Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute

More information

Distributed Systems COMP 212. Lecture 18 Othon Michail

Distributed Systems COMP 212. Lecture 18 Othon Michail Distributed Systems COMP 212 Lecture 18 Othon Michail Virtualisation & Cloud Computing 2/27 Protection rings It s all about protection rings in modern processors Hardware mechanism to protect data and

More information

Project Presentation

Project Presentation Project Presentation Saad Arif Dept. of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL November 7, 2013 1 Introduction 1 Introduction 2 Gallery 1 Introduction 2

More information

UseNet and Gossip Protocol

UseNet and Gossip Protocol CPSC 426/526 UseNet and Gossip Protocol Ennan Zhai Computer Science Department Yale University Recall: Lec-1 Understanding: - Distributed systems vs. decentralized systems - Why we need both? red P2P network

More information

Troubleshooting VoIP in Converged Networks

Troubleshooting VoIP in Converged Networks Troubleshooting VoIP in Converged Networks Terry Slattery Principal Consultant CCIE #1026 1 Objective Provide examples of common problems Troubleshooting tips What to monitor Remediation Tips you can use

More information

Amazon Web Services. Block 402, 4 th Floor, Saptagiri Towers, Above Pantaloons, Begumpet Main Road, Hyderabad Telangana India

Amazon Web Services. Block 402, 4 th Floor, Saptagiri Towers, Above Pantaloons, Begumpet Main Road, Hyderabad Telangana India (AWS) Overview: AWS is a cloud service from Amazon, which provides services in the form of building blocks, these building blocks can be used to create and deploy various types of application in the cloud.

More information

Distributed Systems CS6421

Distributed Systems CS6421 Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool

More information

MapReduce & BigTable

MapReduce & BigTable CPSC 426/526 MapReduce & BigTable Ennan Zhai Computer Science Department Yale University Lecture Roadmap Cloud Computing Overview Challenges in the Clouds Distributed File Systems: GFS Data Process & Analysis:

More information

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg @rusmeshenberg Microservices: all benefits, no costs? Netflix is the world s leading Internet television network

More information

CIT 668: System Architecture. Amazon Web Services

CIT 668: System Architecture. Amazon Web Services CIT 668: System Architecture Amazon Web Services Topics 1. AWS Global Infrastructure 2. Foundation Services 1. Compute 2. Storage 3. Database 4. Network 3. AWS Economics Amazon Services Architecture Regions

More information

Deploy. A step-by-step guide to successfully deploying your new app with the FileMaker Platform

Deploy. A step-by-step guide to successfully deploying your new app with the FileMaker Platform Deploy A step-by-step guide to successfully deploying your new app with the FileMaker Platform Share your custom app with your team! Now that you ve used the Plan Guide to define your custom app requirements,

More information

Exam Distributed Systems

Exam Distributed Systems Exam Distributed Systems 5 February 2010, 9:00am 12:00pm Part 2 Prof. R. Wattenhofer Family Name, First Name:..................................................... ETH Student ID Number:.....................................................

More information

Hosting DesktopNow in Amazon Web Services. Ivanti DesktopNow powered by AppSense

Hosting DesktopNow in Amazon Web Services. Ivanti DesktopNow powered by AppSense Hosting DesktopNow in Amazon Web Services Ivanti DesktopNow powered by AppSense Contents Purpose of this Document... 3 Overview... 3 1 Non load balanced Amazon Web Services Environment... 4 Amazon Web

More information

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB AlwaysOn Availability Groups: Backups, Restores, and CHECKDB www.brentozar.com sp_blitz sp_blitzfirst email newsletter videos SQL Critical Care 2016 Brent Ozar Unlimited. All rights reserved. 1 What I

More information

WHITEPAPER AMAZON ELB: Your Master Key to a Secure, Cost-Efficient and Scalable Cloud.

WHITEPAPER AMAZON ELB: Your Master Key to a Secure, Cost-Efficient and Scalable Cloud. WHITEPAPER AMAZON ELB: Your Master Key to a Secure, Cost-Efficient and Scalable Cloud www.cloudcheckr.com TABLE OF CONTENTS Overview 3 What Is ELB? 3 How ELB Works 4 Classic Load Balancer 5 Application

More information

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on your computer, you have convenient answers with write on aws. To get

More information

Security and Compliance at Mavenlink

Security and Compliance at Mavenlink Security and Compliance at Mavenlink Table of Contents Introduction....3 Application Security....4....4....5 Infrastructure Security....8....8....8....9 Data Security.... 10....10....10 Infrastructure

More information

RA-GRS, 130 replication support, ZRS, 130

RA-GRS, 130 replication support, ZRS, 130 Index A, B Agile approach advantages, 168 continuous software delivery, 167 definition, 167 disadvantages, 169 sprints, 167 168 Amazon Web Services (AWS) failure, 88 CloudTrail Service, 21 CloudWatch Service,

More information

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013

CS / Cloud Computing. Recitation 9 October 22 nd and 25 th, 2013 CS15-319 / 15-619 Cloud Computing Recitation 9 October 22 nd and 25 th, 2013 Announcements Encounter a general bug: Post on Piazza Encounter a grading bug: Post Privately on Piazza Don t ask if my answer

More information

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters CLUSTERING HIVEMQ Building highly available, horizontally scalable MQTT Broker Clusters 12/2016 About this document MQTT is based on a publish/subscribe architecture that decouples MQTT clients and uses

More information

Amazon Aurora Deep Dive

Amazon Aurora Deep Dive Amazon Aurora Deep Dive Kevin Jernigan, Sr. Product Manager Amazon Aurora PostgreSQL Amazon RDS for PostgreSQL May 18, 2017 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Course Information. Course Basics. Course Format. Arvind Krishnamurthy. Instructor: Arvind Krishnamurthy. TA: Naveen Sharma

Course Information. Course Basics. Course Format. Arvind Krishnamurthy. Instructor: Arvind Krishnamurthy. TA: Naveen Sharma Course Information CSE 550: Introduction to Computer Systems Research Arvind Krishnamurthy Instructor: Arvind Krishnamurthy Interests: distributed systems, networks, operating systems, security Email,

More information

PROTECT YOUR DATA FROM MALWARE AND ENSURE BUSINESS CONTINUITY ON THE CLOUD WITH NAVLINK MANAGED AMAZON WEB SERVICES MANAGED AWS

PROTECT YOUR DATA FROM MALWARE AND ENSURE BUSINESS CONTINUITY ON THE CLOUD WITH NAVLINK MANAGED AMAZON WEB SERVICES MANAGED AWS PROTECT YOUR DATA FROM MALWARE AND ENSURE BUSINESS CONTINUITY ON THE CLOUD WITH NAVLINK MANAGED AMAZON WEB SERVICES MANAGED AWS Improved performance Faster go-to-market Better security In today s disruptive

More information

Pega 7 Webinar Series

Pega 7 Webinar Series Pega 7 Webinar Series Achieve the Highest Levels of Availability, Scalability and Performance Mark Replogle Ken Schwarz October 16, 2013 Pega 7 Webinar Series 01 The Power to Simplify Pega 7 Overview Thursday,

More information

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4 Cloud & container monitoring 04.05.2018, Lars Michelsen Some cloud definitions Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Software-as-a-Service (SaaS) Applications

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Case Study Virtual Expo. Virtual Expo, a leader in online B2B exhibitions, used Varnish Plus Cloud to build and enhance their own in-house CDN

Case Study Virtual Expo. Virtual Expo, a leader in online B2B exhibitions, used Varnish Plus Cloud to build and enhance their own in-house CDN Case Study Virtual Expo Virtual Expo, a leader in online B2B exhibitions, used Varnish Plus Cloud to build and enhance their own in-house CDN Varnish Plus Cloud on AWS allows flexibility and permits getting

More information

Persistent Storage with Docker in production - Which solution and why?

Persistent Storage with Docker in production - Which solution and why? Persistent Storage with Docker in production - Which solution and why? Cheryl Hung 2013-2017 StorageOS Ltd. All rights reserved. Cheryl 2013-2017 StorageOS Ltd. All rights reserved. 2 Why do I need storage?

More information

Abstractions for Model Checking SDN Controllers. Divjyot Sethi, Srinivas Narayana, Prof. Sharad Malik Princeton University

Abstractions for Model Checking SDN Controllers. Divjyot Sethi, Srinivas Narayana, Prof. Sharad Malik Princeton University Abstractions for Model Checking SDN s Divjyot Sethi, Srinivas Narayana, Prof. Sharad Malik Princeton University Traditional Networking Swt 1 Swt 2 Talk OSPF, RIP, BGP, etc. Swt 3 Challenges: - Difficult

More information

Security of End User based Cloud Services Sang Young

Security of End User based Cloud Services Sang Young Security of End User based Cloud Services Sang Young Chairman, Mobile SIG Professional Information Security Association sang.young@pisa.org.hk Cloud Services you can choose Social Media Business Applications

More information

Introduction to Cloud Computing

Introduction to Cloud Computing You will learn how to: Build and deploy cloud applications and develop an effective implementation strategy Leverage cloud vendors Amazon EC2 and Amazon S3 Exploit Software as a Service (SaaS) to optimize

More information

Data Protection Done Right with Dell EMC and AWS

Data Protection Done Right with Dell EMC and AWS Data Protection Done Right with Dell EMC and AWS There is a growing need for superior data protection Introduction If you re like most businesses, your data is continually growing in speed and scale, and

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Hyperconverged Cloud Architecture with OpenNebula and StorPool

Hyperconverged Cloud Architecture with OpenNebula and StorPool Hyperconverged Cloud Architecture with OpenNebula and StorPool Version 1.0, January 2018 Abstract The Hyperconverged Cloud Architecture with OpenNebula and StorPool is a blueprint to aid IT architects,

More information

Training on Amazon AWS Cloud Computing. Course Content

Training on Amazon AWS Cloud Computing. Course Content Training on Amazon AWS Cloud Computing Course Content 15 Amazon Web Services (AWS) Cloud Computing 1) Introduction to cloud computing Introduction to Cloud Computing Why Cloud Computing? Benefits of Cloud

More information

High Availability and Disaster Recovery. Cherry Lin, Jonathan Quinn

High Availability and Disaster Recovery. Cherry Lin, Jonathan Quinn High Availability and Disaster Recovery Cherry Lin, Jonathan Quinn Managing the Twin Risks to your Operations Data Loss Down Time The Three Approaches Backups High Availability Disaster Recovery Geographic

More information

How to backup Ceph at scale. FOSDEM, Brussels,

How to backup Ceph at scale. FOSDEM, Brussels, How to backup Ceph at scale FOSDEM, Brussels, 2018.02.04 About me Bartłomiej Święcki OVH Wrocław, PL Current job: More Ceph awesomeness Speedlight Ceph intro Open-source Network storage Scalable Reliable

More information

vsphere Replication for Disaster Recovery to Cloud

vsphere Replication for Disaster Recovery to Cloud vsphere Replication for Disaster Recovery to Cloud vsphere Replication 5.6 This document supports the version of each product listed and supports all subsequent versions until the document is replaced

More information

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content Introduction to Cloud Computing A Short history Client Server Computing Concepts Challenges with Distributed Computing Introduction

More information

CLOUD COMPUTING. A public cloud sells services to anyone on the Internet. The cloud infrastructure is made available to

CLOUD COMPUTING. A public cloud sells services to anyone on the Internet. The cloud infrastructure is made available to CLOUD COMPUTING In the simplest terms, cloud computing means storing and accessing data and programs over the Internet instead of your computer's hard drive. The cloud is just a metaphor for the Internet.

More information

Embracing Failure. Fault Injec,on and Service Resilience at Ne6lix. Josh Evans Director of Opera,ons Engineering, Ne6lix

Embracing Failure. Fault Injec,on and Service Resilience at Ne6lix. Josh Evans Director of Opera,ons Engineering, Ne6lix Embracing Failure Fault Injec,on and Service Resilience at Ne6lix Josh Evans Director of Opera,ons Engineering, Ne6lix Josh Evans 24 years in technology Tech support, Tools, Test Automa,on, IT & QA Management

More information

Amazon Web Services Training. Training Topics:

Amazon Web Services Training. Training Topics: Amazon Web Services Training Training Topics: SECTION1: INTRODUCTION TO CLOUD COMPUTING A Short history Client Server Computing Concepts Challenges with Distributed Computing Introduction to Cloud Computing

More information

Everything You Need to Know About MySQL Group Replication

Everything You Need to Know About MySQL Group Replication Everything You Need to Know About MySQL Group Replication Luís Soares (luis.soares@oracle.com) Principal Software Engineer, MySQL Replication Lead Copyright 2017, Oracle and/or its affiliates. All rights

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 10. Consensus: Paxos Paul Krzyzanowski Rutgers University Fall 2017 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value

More information

Unlimited Scalability in the Cloud A Case Study of Migration to Amazon DynamoDB

Unlimited Scalability in the Cloud A Case Study of Migration to Amazon DynamoDB Unlimited Scalability in the Cloud A Case Study of Migration to Amazon DynamoDB Steve Saporta CTO, SpinCar Mar 19, 2016 SpinCar When a web-based business grows... More customers = more transactions More

More information

MySQL Enterprise High Availability

MySQL Enterprise High Availability MySQL Enterprise High Availability A Reference Guide A MySQL White Paper 2018, Oracle Corporation and/or its affiliates Table of Contents MySQL Enterprise High Availability... 1 A Reference Guide... 1

More information

Container Orchestration on Amazon Web Services. Arun

Container Orchestration on Amazon Web Services. Arun Container Orchestration on Amazon Web Services Arun Gupta, @arungupta Docker Workflow Development using Docker Docker Community Edition Docker for Mac/Windows/Linux Monthly edge and quarterly stable

More information

Protecting remote site data SvSAN clustering - failure scenarios

Protecting remote site data SvSAN clustering - failure scenarios White paper Protecting remote site data SvSN clustering - failure scenarios Service availability and data integrity are key metrics for enterprises that run business critical applications at multiple remote

More information

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions 1) A company is storing an access key (access key ID and secret access key) in a text file on a custom AMI. The company uses the access key to access DynamoDB tables from instances created from the AMI.

More information

Automated Upgrades of PostgreSQL Clusters in Cloud

Automated Upgrades of PostgreSQL Clusters in Cloud Gülçin Yıldırım Automated Upgrades of PostgreSQL Clusters in Cloud 1 select * from me; Site Reliability Engineer @ 2ndQuadrant Board Member @ PostgreSQL Europe MSc Comp. & Systems Eng. @ Tallinn University

More information

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together Fault-Tolerant Computer System Design ECE 695/CS 590 Putting it All Together Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Looking at some practical systems that integrate multiple techniques

More information