Embracing Failure. Fault Injec,on and Service Resilience at Ne6lix. Josh Evans Director of Opera,ons Engineering, Ne6lix

Similar documents
How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

A Cloud Gateway - A Large Scale Company s First Line of Defense. Mikey Cohen Manager - Edge Gateway Netflix

Cloud Computing ECPE 276. Ne*lix Case Study

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Application Resilience Engineering and Operations at Netflix. Ben Software Engineer on API Platform at Netflix

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

The Antifragile Organization

Designing Fault-Tolerant Applications

Retrospective: The Magento Commerce Cloud at Work

DevOps and Microservices. Cristian Klein

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Open Tech Alliance. Case Study

Security Precognition: Chaos Engineering in Incident Response

From Continuous Integration To Continuous Delivery With Jenkins

IPv6. Akamai. Faster Forward with IPv6. Eric Lei Cao Head, Network Business Development Greater China Akamai Technologies

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Open Connect Overview

Testing in the Cloud. Not just for the Birds, but Monkeys Too!

Take Risks But Don t Be Stupid! Patrick Eaton, PhD

RA-GRS, 130 replication support, ZRS, 130

SUPERIOR MISSION SYSTEMS Faster, Resilient, Secure & More Affordable

Scalability in a Real-Time Decision Platform

Con$nuous Deployment with Docker Andrew Aslinger. Oct

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

7/22/2008. Transformations

Resilience Validation

Mul$media Networking. #9 CDN Solu$ons Semester Ganjil 2012 PTIIK Universitas Brawijaya

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

Founda'ons of So,ware Engineering. Lecture 11 Intro to QA, Tes2ng Claire Le Goues

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Odin: Microsoft s Scalable Fault-

InterCall Virtual Environments and Webcasting

Datacenter replication solution with quasardb

Best Practices for Alert Tuning. This white paper will provide best practices for alert tuning to ensure two related outcomes:

AWS: Basic Architecture Session SUNEY SHARMA Solutions Architect: AWS

Microservices Architekturen aufbauen, aber wie?

Automate best practices and operational health for your AWS resources with Trusted Advisor and AWS Health

The Road to Istio: How IBM, Google and Lyft Joined Forces to Simplify Microservices

CLOUD SERVICES. Cloud Value Assessment.

Introducing Cisco Network Assurance Engine

How to destroy networks for fun (and profit)

PREPARING FOR DISASTER

BUILDING SCALABLE AND RESILIENT OTT SERVICES AT SKY

Amazon Aurora Relational databases reimagined.

Akamai's V6 Rollout Plan and Experience from a CDN Point of View. Christian Kaufmann Director Network Architecture Akamai Technologies, Inc.

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

HTTP Adap)ve Streaming in prac)ce

White Paper Amazon Aurora A Fast, Affordable and Powerful RDBMS

Benefits of a SD-WAN Development Ecosystem

BT Connect Acceleration

Designing Services for Resilience Experiments: Lessons from Netflix. Nora Jones, Senior Chaos

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

Cisco Crosswork Network Automation

Software Defined Storage for the Evolving Data Center

The Guide to Best Practices in PREMIUM ONLINE VIDEO STREAMING

Puppet Enterprise And Splunk PlaJorm: Improve Your ApplicaGon Delivery Velocity

Testing & Assuring Mobile End User Experience Before Production Neotys

5 reasons why choosing Apache Cassandra is planning for a multi-cloud future

Failures in Distributed Systems

Cloud Native Architecture 300. Copyright 2014 Pivotal. All rights reserved.

AWS Solution Architecture Patterns

INTELLIGENT DNS AS A CORNERSTONE OF DIGITAL TRANSFORMATION. CLOUDEXPO 2017

Accessibility & Reliability

Das Leben ist zu kurz

Data Integrity in Stateful Services. Velocity, China, 2016

Trends and challenges Managing the performance of a large-scale network was challenging enough when the infrastructure was fairly static. Now, with Ci

Atlassian s Journey Into Splunk

BUILDING RESILIENCE in PRODUCTION MIGRATIONS. Sangeeta Handa Billing Infrastructure Engineering

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Finding a needle in Haystack: Facebook's photo storage

Amazon Aurora Deep Dive

QLIK INTEGRATION WITH AMAZON REDSHIFT

Intro to Netflix Chaos Monkey

The Internet Evolution and Opportunity

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

March 10 11, 2015 San Jose

Mastering Near-Real-Time Telemetry and Big Data: Invaluable Superpowers for Ordinary SREs

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Engaging Employees and Customers with Video. The Benefits of Corporate Webcas3ng

What Building Multiple Scalable DC/OS Deployments Taught Me about Running Stateful Services on DC/OS

The Software Development Process at Amazon

Scaling on AWS. From 1 to 10 Million Users. Matthias Jung, Solutions Architect

Machine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Cloud Monitoring as a Service. Built On Machine Learning

API, DEVOPS & MICROSERVICES

Cloud & AWS Essentials Agenda. Introduction What is the cloud? DevOps approach Basic AWS overview. VPC EC2 and EBS S3 RDS.

C3: INTERNET-SCALE CONTROL PLANE FOR VIDEO QUALITY OPTIMIZATION

Securely Access Services Over AWS PrivateLink. January 2019

@ COUCHBASE CONNECT. Using Couchbase. By: Carleton Miyamoto, Michael Kehoe Version: 1.1w LinkedIn Corpora3on

Developing Microsoft Azure Solutions (70-532) Syllabus

Coca-Cola Migration Case Study


Connected & Autonomous vehicles

HIDDEN SLIDE Summary These slides are meant to be used as is to give an upper level view of perfsonar for an audience that is not familiar with the

Microservices on AWS. Matthias Jung, Solutions Architect AWS

Con$nuous Integra$on Development Environment. Kovács Gábor

Identifying Workloads for the Cloud

A Practitioner s Guide to Migrating Workloads to VMware Cloud on AWS

Imperva Incapsula Product Overview

Transcription:

Embracing Failure Fault Injec,on and Service Resilience at Ne6lix Josh Evans Director of Opera,ons Engineering, Ne6lix

Josh Evans 24 years in technology Tech support, Tools, Test Automa,on, IT & QA Management Time at Ne6lix ~15 years Ecommerce, streaming, tools, services, opera,ons Current Role: Director of Opera,ons Engineering

Ne5lix Ecosystem Sta,c Content Akamai ISP AWS/Ne6lix Control Plane 48 million members, 41 countries > 1 billion hours per month > 1000 device types 3 AWS Regions, hundreds of services Hundreds of thousands of requests/second Partner provided services (Xbox Live, PSN) CDN serving petabytes of data at terabits/second Ne6lix CDN Service Partners

Our Focus is on Quality and Velocity Availability vs. Rate of Change 99.9999% 6 31.5 seconds 99.999% 5 5.26 minutes 99.99% 99.9% 99% 90% Availability (nines) 4 3 2 1 52.56 minutes 8.76 hours 3.26 days 36.5 days 0 1 10 100 1000 Rate of Change

We Seek 99.99% Availability for Starts Availability vs. Rate of Change 99.9999% 6 31.5 seconds 99.999% 5 5.26 minutes 99.99% 99.9% 99% 90% Availability (nines) 4 3 2 1 52.56 minutes 8.76 hours 3.26 days 36.5 days 0 1 10 100 1000 Rate of Change

Our goal is to shie the curve Availability vs. Rate of Change 99.9999% 99.999% 99.99% 99.9% 99% 90% Availability (nines) 6 5 4 3 2 1 Engineering Opera,ons Best Prac,ces Con,nuous Improvement 0 1 10 100 1000 Rate of Change

Availability means that members can sign up ac,vate a device browse watch

What keeps us up at night

Failures happen all the,me Disks fail Power goes, and your generator fails Sogware bugs Human error Failure is unavoidable

We design for failure Excep,on handling Auto- scaling clusters Redundancy Fault tolerance and isola,on Fall- backs and degraded experiences Protect the customer from failures

Is that enough?

No How do we know if we ve succeeded? Does the system work as designed? Is it as resilient as we believe? How do we prevent driging into failure?

Unit tes,ng Integra,on tes,ng Stress/load tes,ng Simula,on matrices We test for failure

Tes,ng increases confidence but is that enough?

Tes,ng distributed systems is hard Massive, changing data sets Web- scale traffic Complex interac,ons and informa,on flows Asynchronous requests 3 rd party services All while innova,ng and improving our service

What if we regularly inject failures into our systems under controlled circumstances?

Embracing Failure in Produc,on Don t wait for random failures Cause failure to validate resiliency Test design assump,ons by stressing them Remove uncertainty by forcing failures regularly

Two Key Concepts

Auto- Scaling Virtual instance clusters that scale and shrink with traffic Reactive and predictive mechanisms Auto-replacement of bad instances

Circuit Breakers

An Instance Fails Monkey loose in your DC Run during business hours Instances fail all the,me What we learned State is problema,c Auto- replacement works Surviving a single instance failure is not enough

A Data Center Fails Chaos Gorilla Simulate an availability zone outage 3- zone configura,on Eliminate one zone Ensure that others can handle the load and nothing breaks

What we encountered

What we learned Large scale events are hard to simulate Hundreds of clusters, thousands of instances Rapidly shiging traffic is error prone LBs must expire connec,ons quickly Lingering connec,ons to caches must be addressed Not all clusters pre- scaled for addi,onal load

What we learned Hidden assump,ons & configura,ons Some apps not configured for cross- zone calls Mismatched,meouts fallbacks prevented fail- over REST client preserva,on mode prevented fail- over Cassandra works as expected

Regrouping From zone outage to zone evacua,on Carefully deregistered instances Staged traffic shigs Resuming true outage simula,ons soon

Regions Fail Customer Device Geo- located Chaos Kong Regional Load Balancers Zuul Traffic Shaping/Rou,ng AZ1 AZ2 AZ3 Regional Load Balancers Zuul Traffic Shaping/Rou,ng AZ1 AZ2 AZ3 Data Data Data Data Data Data

What we learned It works! Disable predic,ve auto- scaling Use instance counts from previous day

Room for Improvement Not a true regional outage simula,on Staged migra,on No split brain

Not everything fails completely Latency Monkey Simulate latent service calls Inject arbitrary latency and errors at the service level Observe for effects

Service Architecture Service A Device Internet ELB Zuul Edge Service B AWS Service C

Latency Monkey Service A Device Internet ELB Zuul Edge Service B Server- side URI filters All requests URI payern match Percentage of requests Service C Arbitrary delays or responses

What we learned Startup resiliency is an issue Services owners don t know all dependencies Fallbacks can fail too Second order effects not easily tested Dependencies change over,me Holis,c view is necessary Some teams opt out

Fault Injec,on Tes,ng (FIT) Request- level simula,ons Service A Device Internet ELB Zuul Edge Service B Device or Account Override? Service C

Benefits Confidence building for latency monkey tes,ng Con,nuous resilience tes,ng in test and produc,on Tes,ng of minimum viable service, fallbacks Device resilience evalua,on

Device Resiliency Matrix

Is that it? Fault- injec,on isn t enough Bad code/deployments Configura,on mishaps Byzan,ne failures Memory leaks Performance degrada,on

A Multi-Faceted Approach Continuous Build, Delivery, Deployment Test Environments, Infrastructure, Coverage Automated Canary Analysis Staged Configuration Changes Crisis Response Tooling & Operations Real-time Analytics, Detection, Alerting Operational Insight - Dashboards, Reports Performance and Efficiency Engagements

It s also about people and culture

Technical Culture You build it, you run it Each failure is an opportunity to learn Blameless incident reviews Commitment to con,nuous improvement

Context and Collabora,on Context engages partners Data and root causes Global vs. local Urgent vs. important Long term vision Collabora,on yields beyer solu,ons and buy- in

The Simian Army is part of the Ne6lix open source cloud pla6orm hyp://ne6lix.github.com

Josh Evans Director of Operations Engineering, Netflix jevans@netflix.com, @josh_evans_nflx