Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Similar documents
Migrating to Cassandra in the Cloud, the Netflix Way

Designing Fault-Tolerant Applications

How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Aurora, RDS, or On-Prem, Which is right for you

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

AWS Solutions Architect Associate (SAA-C01) Sample Exam Questions

Introduction to Database Services

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

AWS Storage Gateway. Not your father s hybrid storage. University of Arizona IT Summit October 23, Jay Vagalatos, AWS Solutions Architect

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Amazon Aurora Deep Dive

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

@joerg_schad Nightmares of a Container Orchestration System

Advanced Architectures for Oracle Database on Amazon EC2

Apache Cassandra. Tips and tricks for Azure

Write On Aws. Aws Tools For Windows Powershell User Guide using the aws tools for windows powershell (p. 19) this section includes information about

CIT 668: System Architecture. Amazon Web Services

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Red Hat Storage Server for AWS

Amazon Aurora Relational databases reimagined.

Which technology to choose in AWS?

Amazon. Exam Questions AWS-Certified-Solutions-Architect- Professional. AWS-Certified-Solutions-Architect-Professional.

Amazon Aurora Deep Dive

Real-Time & Big Data GIS: Best Practices. Suzanne Foss Josh Joyner

Principal Solutions Architect. Architecting in the Cloud

Amazon Aurora Deep Dive

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

Introduction to Cloud Computing

Amazon AWS-Solution-Architect-Associate Exam

Introduction to Amazon Cloud & EC2 Overview

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Automating Elasticity. March 2018

AWS Storage Optimization. AWS Whitepaper

LINUX, WINDOWS(MCSE),

AWS_SOA-C00 Exam. Volume: 758 Questions

Machine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider

Mike Kania Truss

AWS Course Syllabus. Linux Fundamentals. Installation and Initialization:

Training on Amazon AWS Cloud Computing. Course Content

Developing Microsoft Azure Solutions (70-532) Syllabus

Prototyping Data Intensive Apps: TrendingTopics.org

CS 655 Advanced Topics in Distributed Systems

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Deep Dive on Amazon Elastic File System

Amazon Web Services and Feb 28 outage. Overview presented by Divya

Deep Dive into Cloud Native Open Source with NetflixOSS

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Intro to Netflix Chaos Monkey

Cloudian Sizing and Architecture Guidelines

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Run your own Open source. (MMS) to avoid vendor lock-in. David Murphy MongoDB Practice Manager, Percona

Pass4test Certification IT garanti, The Easy Way!

Getting Started Guide. VMware NSX Cloud services

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

MySQL in the Cloud Tricks and Tradeoffs

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT

Design Patterns for the Cloud. MCSN - N. Tonellotto - Distributed Enabling Platforms 68

DataStax Enterprise 4.0 In-Memory Option A look at performance, use cases, and anti-patterns. White Paper

EXAM - AWS-Solution-Architect- Associate. AWS Certified Solutions Architect - Associate. Buy Full Product

How to host and manage enterprise customers on AWS: TOYOTA, Nippon Television, UNIQLO use cases

White Paper Amazon Aurora A Fast, Affordable and Powerful RDBMS

Microsoft Windows Server Failover Clustering (WSFC) and SQL Server AlwaysOn Availability Groups on the AWS Cloud: Quick Start Reference Deployment

Cloud Computing /AWS Course Content

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

BIG DATA AND CONSISTENCY. Amy Babay

Developing Microsoft Azure Solutions (70-532) Syllabus

Running MySQL on AWS. Michael Coburn Wednesday, April 15th, 2015

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Real-Time & Big Data GIS: Best Practices. Josh Joyner Adam Mollenkopf

A Cloud Gateway - A Large Scale Company s First Line of Defense. Mikey Cohen Manager - Edge Gateway Netflix

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

AWS Certified Solutions Architect - Associate 2018 (SAA-001)

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

Enroll Now to Take online Course Contact: Demo video By Chandra sir

The Cloud's Cutting Edge: ArcGIS for Server Use Cases for Amazon Web Services. David Cordes David McGuire Jim Herries Sridhar Karra

AWS Administration. Suggested Pre-requisites Basic IT Knowledge

70-532: Developing Microsoft Azure Solutions

Scaling Massive Content Stores in the Cloud. CloudExpo New York June Alfresco Founder & CTO

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Architecting for Scale

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN

Ahead in the Cloud. Matt Wood TECHNOLOGY EVANGELIST

Basics of Cloud Computing Lecture 2. Cloud Providers. Satish Srirama

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PostgresConf US

Cloud Analytics and Business Intelligence on AWS

Deep Dive on Amazon Relational Database Service

Intro Cassandra. Adelaide Big Data Meetup.

Flash Storage Complementing a Data Lake for Real-Time Insight

Edge for All Business

Distributed File Systems II

Migrating to Aurora MySQL and Monitoring with PMM. Percona Technical Webinars August 1, 2018

Transcription:

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, Netflix @eatupmartha #netflixcloud #cassandra12 1

June 29, 2012 2

3

4

[1] 5

From the Netflix tech blog: Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability. [2] 6

Topics Cassandra at Netflix Constructing clusters in AWS with Priam Resiliency Observations on AWS, Cassandra and AWS/Cassandra Monitoring and maintenances References 7

Cassandra by the numbers 41 Number of production clusters 13 Number of multi-region clusters 4 Max regions, one cluster 90 Total TB of data across all clusters 621 Number of Cassandra nodes 72/34 Largest Cassandra cluster (nodes/data in TB) 80k/250k Max read/writes per second on a single cluster 3 * Size of Operations team * We are hiring DevOps and Developers. Stop by our booth! 8

Netflix Deployed on AWS Content Logs Play WWW API CS Content Management S3 Terabytes DRM Sign-Up Metadata International CS lookup EC2 Encoding EMR CDN routing Search Device Configuration Diagnostics & Actions S3 Petabytes Hive & Pig Bookmarks Movie Choosing TV Movie Choosing Customer Call Log Business Intelligence Logging Ratings Social Facebook CS Analytics CDNs ISPs Terabits Customers

Constructing clusters in AWS with Priam Tomcat webapp for Cassandra administration Token management Full and incremental backups JMX metrics collection cassandra.yaml configuration REST API for most nodetool commands AWS Security Groups for multi-region clusters Open sourced, available on github [3] 10

Region Address DC Rack Status State Load Owns Token ###.##.##.### eu- west 1a Up Normal 108.97 GB 16.67% ###.##.#.## us- east 1e Up Normal 103.72 GB 0.00% ##.###.###.### eu- west 1b Up Normal 104.82 GB 16.67% ##.##.##.### us- east 1c Up Normal 111.87 GB 0.00% ##.###.##.### eu- west 1c Up Normal 95.51 GB 16.67% ##.##.##.## us- east 1d Up Normal 105.85 GB 0.00% ##.###.##.### eu- west 1a Up Normal 91.25 GB 16.67% ###.##.##.### us- east 1e Up Normal 102.71 GB 0.00% ##.###.###.### eu- west 1b Up Normal 101.87 GB 16.67% ##.##.###.## us- east 1c Up Normal 102.83 GB 0.00% ###.##.###.## eu- west 1c Up Normal 96.66 GB 16.67% ##.##.##.### us- east 1d Up Normal 99.68 GB 0.00% Autoscaling Groups ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc). Amazon machine image Image loaded on to an AWS instance; all packages needed to run an application. Security Group Defines access control between ASGs Instance Availability Zone (AZ) A AWS Terminology Constructing a cluster in AWS 11

Multi-region clusters have the same configuration in each region. Just repeat what you see here! App = cass_cluster ASG # 1 ASG # 2 ASG # 3 Availabilty Zone = A Availability Zone = B Availability Zone = C Region = us-east Region = us-east Region = us-east Instance count = 6 Instance count = 6 Instance count = 6 APP is not an AWS entity, but one that we use internally to denote a service. This is part of asgard [4], our opensourced cloud application web interface Instance type = m2.4xlarge Instance type = m2.4xlarge Instance type = m2.4xlarge Full and incremental Backups to local-region S3 via Priam S3 S3 External full backups to an alternate region saved for 30 days. B Cassandra Configuration Constructing a cluster in AWS 12

AMI contains os, base netflix packages and Cassandra and Priam Priam runs on each node and will: (1) Alternate availability zones (a, b, c) around the ring to ensure data is written to multiple data centers. (2) Survive the loss of a data center by ensuring that we only lose one node from each replication set. A B C C B A A B c C B A Cassandra Priam Tomcat S3 * Assign tokens to each node, alternating (1) the AZs around the ring (2). * Perform nightly snapshot backup to S3 * Perform incremental SSTable backups to S3 * Bootstrap replacement nodes to use vacated tokens * Collect JMX metrics for our monitoring systems * REST API calls to most nodetool functions C Putting it all together Constructing a cluster in AWS 13

AMI contains os, base netflix packages and Cassandra and Priam Priam runs on each node and will: (1) Alternate availability zones (a, b, c) around the ring to ensure data is written to multiple data centers. (2) Survive the loss of a data center by ensuring that we only lose one node from each replication set. A B C C B A A B c C B A Cassandra Priam Tomcat S3 * Assign tokens to each node, alternating (1) the AZs around the ring (2). * Perform nightly snapshot backup to S3 * Perform incremental SSTable backups to S3 * Bootstrap replacement nodes to use vacated tokens * Collect JMX metrics for our monitoring systems * REST API calls to most nodetool functions C Putting it all together Constructing a cluster in AWS

Resiliency - Instance RF=AZ=3 Cassandra bootstrapping works really well Replace nodes immediately Repair often 15

Resiliency One availability zone RF=AZ=3 Alternating AZs ensures that each AZ has a full replica of data Provision cluster to run at 2/3 capacity Ride out a zone outage; do not move to another zone Bootstrap one node at a time Repair after recovery 16

What happened on June 29 th? During outage All Cassandra instances in us-east-1a were inaccessible nodetool ring showed all nodes as DOWN! Monitoring other AZs to ensure availability Recovery power restored to us-east-1a Majority of instances rejoined the cluster without issue Majority of remainder required a reboot to fix Remainder of nodes needed to be replaced, one at a time 17

Resiliency Multiple availability zones Outage; can no longer satisfy quorum Restore from backup and repair 18

Resiliency - Region Connectivity loss between regions operate as island clusters until service restored Repair data between regions If an entire region disappears, watch DVDs instead 19

Observations: AWS Ephemeral drive performance is better than EBS S3-backed AMIs help us weather EBS outages Instances seldom die on their own Use as many availability zones as you can afford Understand how AWS launches instances I/O is constrained in most AWS instance types Repairs are very I/O intensive Large size-tiered compactions can impact latency SSDs [5] are game changers [6] 20

Observations: Cassandra A slow node is worse than a down node Cold cache increases load and kills latency Use whatever dials you can find in an emergency Remove node from coordinator list Compaction throttling Min/max compaction thresholds Enable/disable gossip Leveled compaction performance is very promising 1.1.x and 1.2.x should address some big issues 21

Monitoring Actionable Hardware and network issues Cluster consistency Cumulative trends Informational Schema changes Log file errors/exceptions Recent restarts 22

Dashboards - identify anomalies 23

Maintenances Repair clusters regularly Run off-line major compactions to avoid latency SSDs will make this unnecessary Always replace nodes when they fail Periodically replace all nodes in the cluster Upgrade to new versions Binary (rpm) for major upgrades or emergencies Rolling AMI push over time 24

References 1. A bad night: Netflix and Instagram go down amid Amazon Web Services outage (theverge.com) 2. Lessons Netflix learned from AWS Storm (techblog.netflix.com) 3. github / Netflix / priam (github.com) 4. github / Netflix / asgard (github.com) 5. Announcing High I/O Instances for Amazon (aws.amazon.com) 6. Benchmarking High Performance I/O with SSD for Cassandra on AWS (techblog.netflix.com) 25