Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, Netflix @eatupmartha #netflixcloud #cassandra12 1

June 29, 2012 2

[1] 5

From the Netflix tech blog: Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability. [2] 6

Topics Cassandra at Netflix Constructing clusters in AWS with Priam Resiliency Observations on AWS, Cassandra and AWS/Cassandra Monitoring and maintenances References 7

Cassandra by the numbers 41 Number of production clusters 13 Number of multi-region clusters 4 Max regions, one cluster 90 Total TB of data across all clusters 621 Number of Cassandra nodes 72/34 Largest Cassandra cluster (nodes/data in TB) 80k/250k Max read/writes per second on a single cluster 3 * Size of Operations team * We are hiring DevOps and Developers. Stop by our booth! 8

Netflix Deployed on AWS Content Logs Play WWW API CS Content Management S3 Terabytes DRM Sign-Up Metadata International CS lookup EC2 Encoding EMR CDN routing Search Device Configuration Diagnostics & Actions S3 Petabytes Hive & Pig Bookmarks Movie Choosing TV Movie Choosing Customer Call Log Business Intelligence Logging Ratings Social Facebook CS Analytics CDNs ISPs Terabits Customers

Constructing clusters in AWS with Priam Tomcat webapp for Cassandra administration Token management Full and incremental backups JMX metrics collection cassandra.yaml configuration REST API for most nodetool commands AWS Security Groups for multi-region clusters Open sourced, available on github [3] 10

Region Address DC Rack Status State Load Owns Token ###.##.##.### eu- west 1a Up Normal 108.97 GB 16.67% ###.##.#.## us- east 1e Up Normal 103.72 GB 0.00% ##.###.###.### eu- west 1b Up Normal 104.82 GB 16.67% ##.##.##.### us- east 1c Up Normal 111.87 GB 0.00% ##.###.##.### eu- west 1c Up Normal 95.51 GB 16.67% ##.##.##.## us- east 1d Up Normal 105.85 GB 0.00% ##.###.##.### eu- west 1a Up Normal 91.25 GB 16.67% ###.##.##.### us- east 1e Up Normal 102.71 GB 0.00% ##.###.###.### eu- west 1b Up Normal 101.87 GB 16.67% ##.##.###.## us- east 1c Up Normal 102.83 GB 0.00% ###.##.###.## eu- west 1c Up Normal 96.66 GB 16.67% ##.##.##.### us- east 1d Up Normal 99.68 GB 0.00% Autoscaling Groups ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc). Amazon machine image Image loaded on to an AWS instance; all packages needed to run an application. Security Group Defines access control between ASGs Instance Availability Zone (AZ) A AWS Terminology Constructing a cluster in AWS 11

Multi-region clusters have the same configuration in each region. Just repeat what you see here! App = cass_cluster ASG # 1 ASG # 2 ASG # 3 Availabilty Zone = A Availability Zone = B Availability Zone = C Region = us-east Region = us-east Region = us-east Instance count = 6 Instance count = 6 Instance count = 6 APP is not an AWS entity, but one that we use internally to denote a service. This is part of asgard [4], our opensourced cloud application web interface Instance type = m2.4xlarge Instance type = m2.4xlarge Instance type = m2.4xlarge Full and incremental Backups to local-region S3 via Priam S3 S3 External full backups to an alternate region saved for 30 days. B Cassandra Configuration Constructing a cluster in AWS 12

AMI contains os, base netflix packages and Cassandra and Priam Priam runs on each node and will: (1) Alternate availability zones (a, b, c) around the ring to ensure data is written to multiple data centers. (2) Survive the loss of a data center by ensuring that we only lose one node from each replication set. A B C C B A A B c C B A Cassandra Priam Tomcat S3 * Assign tokens to each node, alternating (1) the AZs around the ring (2). * Perform nightly snapshot backup to S3 * Perform incremental SSTable backups to S3 * Bootstrap replacement nodes to use vacated tokens * Collect JMX metrics for our monitoring systems * REST API calls to most nodetool functions C Putting it all together Constructing a cluster in AWS 13

Resiliency - Instance RF=AZ=3 Cassandra bootstrapping works really well Replace nodes immediately Repair often 15

Resiliency One availability zone RF=AZ=3 Alternating AZs ensures that each AZ has a full replica of data Provision cluster to run at 2/3 capacity Ride out a zone outage; do not move to another zone Bootstrap one node at a time Repair after recovery 16

What happened on June 29 th? During outage All Cassandra instances in us-east-1a were inaccessible nodetool ring showed all nodes as DOWN! Monitoring other AZs to ensure availability Recovery power restored to us-east-1a Majority of instances rejoined the cluster without issue Majority of remainder required a reboot to fix Remainder of nodes needed to be replaced, one at a time 17

Resiliency Multiple availability zones Outage; can no longer satisfy quorum Restore from backup and repair 18

Resiliency - Region Connectivity loss between regions operate as island clusters until service restored Repair data between regions If an entire region disappears, watch DVDs instead 19

Observations: AWS Ephemeral drive performance is better than EBS S3-backed AMIs help us weather EBS outages Instances seldom die on their own Use as many availability zones as you can afford Understand how AWS launches instances I/O is constrained in most AWS instance types Repairs are very I/O intensive Large size-tiered compactions can impact latency SSDs [5] are game changers [6] 20

Observations: Cassandra A slow node is worse than a down node Cold cache increases load and kills latency Use whatever dials you can find in an emergency Remove node from coordinator list Compaction throttling Min/max compaction thresholds Enable/disable gossip Leveled compaction performance is very promising 1.1.x and 1.2.x should address some big issues 21

Monitoring Actionable Hardware and network issues Cluster consistency Cumulative trends Informational Schema changes Log file errors/exceptions Recent restarts 22

Dashboards - identify anomalies 23

Maintenances Repair clusters regularly Run off-line major compactions to avoid latency SSDs will make this unnecessary Always replace nodes when they fail Periodically replace all nodes in the cluster Upgrade to new versions Binary (rpm) for major upgrades or emergencies Rolling AMI push over time 24

References 1. A bad night: Netflix and Instagram go down amid Amazon Web Services outage (theverge.com) 2. Lessons Netflix learned from AWS Storm (techblog.netflix.com) 3. github / Netflix / priam (github.com) 4. github / Netflix / asgard (github.com) 5. Announcing High I/O Instances for Amazon (aws.amazon.com) 6. Benchmarking High Performance I/O with SSD for Cassandra on AWS (techblog.netflix.com) 25