Designing Services for Resilience Experiments: Lessons from Netflix. Nora Jones, Senior Chaos

Similar documents
Security Precognition: Chaos Engineering in Incident Response

MSB to Support for Carrier Grade ONAP Microservice Architecture. Huabing Zhao, PTL of MSB Project, ZTE

Application Resilience Engineering and Operations at Netflix. Ben Software Engineer on API Platform at Netflix

Four times Microservices: REST, Kubernetes, UI Integration, Async. Eberhard Fellow

Microprofile Fault Tolerance. Emily Jiang 1.0,

Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Hands-On: Hystrix. Best practices & pitfalls

Resilience Validation

Starting the Avalanche:

The Road to Istio: How IBM, Google and Lyft Joined Forces to Simplify Microservices

Microservices mit Java, Spring Boot & Spring Cloud. Eberhard Wolff

Managing your microservices with Kubernetes and Istio. Craig Box

Microservices. GCPUG Tokyo Kubernetes Engine

Microservices Implementations not only with Java. Eberhard Wolff Fellow

Embracing Failure. Fault Injec,on and Service Resilience at Ne6lix. Josh Evans Director of Opera,ons Engineering, Ne6lix

Service Mesh with Istio on Kubernetes. Dmitry Burlea Software FlixCharter

MICROSERVICES I PRAKTIKEN från tröga monoliter till en arkitektur för kortare ledtider, högre skalbarhet och ökad feltolerans

Technical Brief. A Checklist for Every API Call. Managing the Complete API Lifecycle

CADEC 2016 MICROSERVICES AND DOCKER CONTAINERS MAGNUS LARSSON

Eclipse Incubator. - ASLv2 License.

A Cloud Gateway - A Large Scale Company s First Line of Defense. Mikey Cohen Manager - Edge Gateway Netflix

Eclipse MicroProfile with Thorntail (formerly WildFly Swarm)

Architecting for Scale

Eclipse MicroProfile: Accelerating the adoption of Java Microservices

Chaos Engineering: Why the world needs more resilient

Open Java EE and Eclipse MicroProfile - A New Java Landscape for Cloud Native Apps

How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

Anti-fragile Cloud Architectures. Agim Emruli - mimacom

BARCELONA. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

The Day the DNS Died

September 15th, Finagle + Java. A love story (

Microservices with Red Hat. JBoss Fuse

Cloud Native Architecture 300. Copyright 2014 Pivotal. All rights reserved.

Morgan Bruce Paulo A. Pereira

Teach your (micro)services speak Protocol Buffers with grpc.

Microservices stress-free and without increased heart-attack risk

An Update on Security and Emergency Preparedness Standards for Utilities

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Service Computation 2018 February 18 22, Keynote Speech on Microservices A modern, agile approach to SOA

PREPARING FOR DISASTER

Introduction to reactive programming. Jonas Chapuis, Ph.D.

i2i Systems Self-Service Support Portal

What is Spring Cloud

ONAP Micro-service Design Improvement. Manoj Nair, NetCracker Technologies

Huge Codebases Application Monitoring with Hystrix

Advanced Clojure Microservices. Tobias Bayer Hamburg,

Elastic Load Balancing

Going Reactive. Reactive Microservices based on Vert.x. JavaLand Kristian Kottke

Large-scale Game Messaging in Erlang at IMVU

What Building Multiple Scalable DC/OS Deployments Taught Me about Running Stateful Services on DC/OS

Lesson 2: Using the Number Line to Model the Addition of Integers

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin, Solution Architect, JFrog, Inc.

Survivable Remote Site Telephony Overview, page 1 Survivable Remote Site Telephony Configuration Task Flow, page 1 SRST Restrictions, page 6

Service Mesh and Related Microservice Technologies in ONAP

The Evolution of a Data Project

Hystrix Python Documentation

CS 6343: CLOUD COMPUTING Term Project

Creating your Virtual Data Centre

Microservices without the Servers: AWS Lambda in Action

Best Practices for Alert Tuning. This white paper will provide best practices for alert tuning to ensure two related outcomes:

MODERN APPLICATION ARCHITECTURE DEMO. Wanja Pernath EMEA Partner Enablement Manager, Middleware & OpenShift

LoanPro Software Staging Release

Service Mesh and Microservices Networking

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Error Handling Strategy

A Comparision of Service Mesh Options

Cloud Computing design patterns blueprints

This document (including, without limitation, any product roadmap or statement of direction data) illustrates the planned testing, release and

c360 Multiple Forms User Guide Microsoft Dynamics CRM 4.0 Compatible

Deep Dive into Cloud Native Open Source with NetflixOSS

Everything You Ever Wanted To Know About Resource Scheduling... Almost

A T O O L K I T T O B U I L D D I S T R I B U T E D R E A C T I V E S Y S T E M S

Handling Microservices with Kubernetes - Basic Info

Monitoring Serverless Architectures in AWS

ISTIO 1.0 INTRODUCTION & OVERVIEW OpenShift Commons Briefing Brian redbeard Harrington Product Manager, Istio

0-1 Million in 46 Days Scaling a Facebook Application in Rails

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Engineering Distributed Systems

Engineering Distributed Systems. experiences, lessons, and suggestions

In-cluster Open Source Testing Framework

Monitoring Java in Docker at CDK

SAA-C01. AWS Solutions Architect Associate. Exam Summary Syllabus Questions

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Distributed API Management in a Hybrid Cloud Environment

Russ McRee Bryan Casper

LEAN & MEAN - GO MICROSERVICES WITH DOCKER SWARM AND SPRING CLOUD

CISCO EXAM QUESTIONS & ANSWERS

Build a system health check for Db2 using IBM Machine Learning for z/os

The Fallacies of Distributed Computing: What if the Network Fails?

COPYRIGHTED MATERIAL CONTENTS PART I: THE PRINCIPLES AND PRACTICES OF DOMAIN DRIVEN DESIGN

Resilience, Service- Discovery and Z- D Deployment. Bodo Junglas, York Xylander

Table of Contents 1. Software and Hardware Requirements

Implementing BCM Frameworks. Monday 19 November Aidan O Brien Head of Resilience and Security National Australia Group Europe

Yahoo Search ATS Plugins. Daniel Morilha and Scott Beardsley

Why Microsoft s head is in the clouds and what it means to you.

Failure Comes in Flavors 2: Stability Patterns

P17 System Testing Monday, September 24, 2007

High Availability Distributed (Micro-)services. Clemens Vasters Microsoft

Transcription:

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js

So, how can teams design services for resilience testing? Failure Injection Enabled

So, how can teams design services for resilience testing? Failure Injection Enabled RPC enabled

So, how can teams design services for resilience testing? Failure Injection Enabled RPC enabled Fallback Paths And ways to discover them

So, how can teams design services for resilience testing? Failure Injection Enabled RPC enabled Fallback Paths And ways to discover them Proper monitoring Key business metrics to look for

So, how can teams design services for resilience testing? Failure Injection Enabled RPC enabled Fallback Paths And ways to discover them Proper monitoring Key business metrics to look for Proper timeouts And ways to discover them

Known Ways to Increase Confidence in Resilience

Known Ways to Increase Confidence in Resilience Unit Tests

Known Ways to Increase Confidence in Resilience Integration Tests

New Ways to Increase Confidence in Resilience Chaos Experiments

SPS: Key Business Metric

Chaos Engineering: Netflix s ChAP 100% API Personalization

Chaos Engineering: Netflix s ChAP Gateway 98% API 1% API Control Personalization

Chaos Engineering: Netflix s ChAP Gateway 98% API 1% API Control Personalization

Chaos Engineering: Netflix s ChAP Gateway 98% API 1% API Control 1% API Exp Personalization

Chaos Engineering: Netflix s ChAP Gateway 98% API 1% API Control 1% API Exp Personalization

Monitoring

Monitoring SHORTED

1. Have Failure Injection Testing Enabled.

Sample Failure Injection Library https://github.com/norajones/failureinjectionlibrary

Types of Chaos Failures

Types of Chaos Failures

Criteria&API

Automating Creation of Chaos Experiments

2. Have Good Monitoring in Place for Configuration Changes.

Have Good Monitoring in Place RPC Enabled

Have Good Monitoring in Place RPC Enabled Associated Hystrix Commands

Have Good Monitoring in Place RPC Enabled Associated Hystrix Commands Associated Fallbacks

Have Good Monitoring in Place RPC Enabled Associated Hystrix Commands Associated Fallbacks Timeouts

Have Good Monitoring in Place RPC Enabled Associated Hystrix Commands Associated Fallbacks Timeouts Retries

Have Good Monitoring in Place RPC Enabled Associated Hystrix Commands Associated Fallbacks Timeouts Retries All in One Place!

RPC/Ribbon Java library managing REST clients to/from different services Fast failing/fallback capability

RPC/Ribbon Timeouts

RPC Timeouts At what point does the service give up?

Retries Immediately retrying a failure after an operation is not usually a great idea.

Retries Understand the logic between your timeouts and your retries.

Circuit Breakers/Fallback Paths

Hystrix Commands/Fallback Paths If your service is non-critical, ensure that there are fallback paths in place.

Fallback Strategies Static Content Cache Fallback Service

Fallback Strategies Know what your fallback strategy is and how to get that information.

3.Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

ChAP s Monocle

ChAP s Monocle

ChAP s Monocle

There isn t always money in microservices

Criticality Score

Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Chaos Success Stories

We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was resolved before it resulted in any availability incident!

While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful...

While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful....this likely means that whoever was consuming the fallback was retrying the call, causing an increase in license requests.

Don t lose sight of your company s customers.

@nora_js Takeaways Designing for resiliency testability is a shared responsibility. Configuration changes can cause outages. Have explicit monitoring in place on antipatterns in configuration changes.

Questions? @nora_js