Take Risks But Don t Be Stupid! Patrick Eaton, PhD

Similar documents
IBM Compose Managed Platform for Multiple Open Source Databases

Microservices Architekturen aufbauen, aber wie?

Datacenter replication solution with quasardb

Designing Fault-Tolerant Applications

Microservices on AWS. Matthias Jung, Solutions Architect AWS

Architekturen für die Cloud

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Cloud & AWS Essentials Agenda. Introduction What is the cloud? DevOps approach Basic AWS overview. VPC EC2 and EBS S3 RDS.

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

Architecting for Greater Security in AWS

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Ruby in the Sky with Diamonds. August, 2014 Sao Paulo, Brazil

AWS Well Architected Framework

DevOps Anti-Patterns. Have the Ops team deal with it. Time to fire the Ops team! Let s hire a DevOps unit! COPYRIGHT 2019 MANICODE SECURITY

Splunk & AWS. Gain real-time insights from your data at scale. Ray Zhu Product Manager, AWS Elias Haddad Product Manager, Splunk

Deploying a Private OpenStack Cloud at Scale. Matt Fischer & Clayton O Neill

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

Securely Access Services Over AWS PrivateLink. January 2019

Scaling Massive Content Stores in the Cloud. CloudExpo New York June Alfresco Founder & CTO

RELIABILITY & AVAILABILITY IN THE CLOUD

NewSQL Without Compromise

About Intellipaat. About the Course. Why Take This Course?

SAA-C01. AWS Solutions Architect Associate. Exam Summary Syllabus Questions

WHITEPAPER. MemSQL Enterprise Feature List

Mega-scale Postgres How to run 1,000,000 Postgres Databases

Identifying Workloads for the Cloud

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

DevOps Course Content

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Amazon ElastiCache 8/1/17. Why Amazon ElastiCache is important? Introduction:

High Noon at AWS. ~ Amazon MySQL RDS versus Tungsten Clustering running MySQL on AWS EC2

AWS Administration. Suggested Pre-requisites Basic IT Knowledge

Beyond 1001 Dedicated Data Service Instances

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Managing Openstack in a cloud-native way

How to host and manage enterprise customers on AWS: TOYOTA, Nippon Television, UNIQLO use cases

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

Introduction to Database Services

RA-GRS, 130 replication support, ZRS, 130

A DEVOPS STATE OF MIND. Chris Van Tuin Chief Technologist, West

Introducing RecoverX 2.5

New Approach to Unstructured Data

database reliability engineering what. why. how. Percona Live, Dublin, 2017 Laine Campbell, Sr. Dir, Production Engineering, Fastly

How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

Transform Your Enterprise Search and ediscovery on the AWS Cloud.

Cloud Backup and Recovery for Healthcare and ecommerce

Automate best practices and operational health for your AWS resources with Trusted Advisor and AWS Health

Data Integrity in Stateful Services. Velocity, China, 2016

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

Building Storage-as-a-Service Businesses

Migrating to Aurora MySQL and Monitoring with PMM. Percona Technical Webinars August 1, 2018

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Containers Infrastructure for Advanced Management. Federico Simoncelli Associate Manager, Red Hat October 2016

AWS Solution Architecture Patterns

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

Deploying Liferay Digital Experience Platform in Amazon Web Services

Serverless Computing. Redefining the Cloud. Roger S. Barga, Ph.D. General Manager Amazon Web Services

Actifio Test Data Management

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

AWS Solution Architect Associate

MCSA Windows Server 2012 Configuring Advanced Services

Aurora, RDS, or On-Prem, Which is right for you

Developing Microsoft Azure Solutions (70-532) Syllabus

Fluentd + MongoDB + Spark = Awesome Sauce

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Scaling on AWS. From 1 to 10 Million Users. Matthias Jung, Solutions Architect

Amazon Aurora Relational databases reimagined.

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

A DEVOPS STATE OF MIND. Chris Van Tuin Chief Technologist, West

AWS 101. Patrick Pierson, IonChannel

Test-driven development

Data Acquisition. The reference Big Data stack

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Introduction to Cloud Computing

Migrating and living on RDS/Aurora. life after Datacenters

SwiftStack and python-swiftclient

São Paulo. August,

Take Back Lost Revenue by Activating Virtuozzo Storage Today

Cloud Computing. Amazon Web Services (AWS)

SoftNAS Cloud Data Management Products for AWS Add Breakthrough NAS Performance, Protection, Flexibility

Architecting Microsoft Azure Solutions (proposed exam 535)

Performance Evaluation of NoSQL Databases

Experiences with Serverless Big Data

AWS Certified Solutions Architect - Associate 2018 (SAA-001)

MOC 20417B: Upgrading Your Skills to MCSA Windows Server 2012

Manage AWS Services. Cost, Security, Best Practice and Troubleshooting. Principal Software Engineer. September 2017 Washington, DC

Machine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering

Put Security Into Your DevOps NOW. Or Prepare for the Flood Matthew Fisher Solution Architect, Fortify Federal 08MAR2018

Configuring Advanced Windows Server 2012 Services (412)

MySQL HA Solutions Selecting the best approach to protect access to your data

Nutanix White Paper. Hyper-Converged Infrastructure for Enterprise Applications. Version 1.0 March Enterprise Applications on Nutanix

Overview of AWS Security - Database Services

San Jose Water Company Expedites New Feature Delivery with DevOps Help from ClearScale on AWS

Startups and Mobile Apps on AWS. Dave Schappell, Startup Business Development Manager, AWS September 11, 2013

Accenture Cloud Platform Serverless Journey

Migrating Existing Applications to AWS. Matt Tavis Principal Solutions Architect

Develop and test your Mobile App faster on AWS

Disaster Recovery and Mitigation: Is your business prepared when disaster hits?

YOUR APPLICATION S JOURNEY TO THE CLOUD. What s the best way to get cloud native capabilities for your existing applications?

Transcription:

Take Risks But Don t Be Stupid! Patrick Eaton, PhD preaton@google.com

Take Risks But Don t Be Stupid! Patrick R. Eaton, PhD patrick@stackdriver.com

Stackdriver A hosted service providing intelligent monitoring to help SaaS companies innovate more by reducing the burden of day-to-day operations. Cloud-native and cloud-aware Designed for complex distributed applications Found August 2012 by Izzy Azeri and Dan Belcher Team of ~25, based in Boston Acquired by Google in May 2014

Some Software Cultures Avoid Risks Long release cycles ase s e l Re esse c Pro Long QA cycles Lots of process High cost for mistakes

DevOps Movement Embraces Risk Risk-taking is a foundational principle. Kim, Behr, Spafford call it the Third Way. Experiment; take risks and learn from failure. Use practice and repetition to achieve mastery. source: itrevolution.com

Risk Taking Requires Judgement Balance risk and reward Take risks to push boundaries Retreat when you cross into the danger zone Credit: Adam Von Gerichten

Goals A healthy view of risk-taking How to design systems so that the impact of failures can be managed Examples from Stackdriver of costconscious experimentation source: kabuki00.pinger.pl

Are You Ready for Some Football? Super Bowl XLVII - February 3, 2013 Baltimore Ravens vs. San Francisco 49ers Won by Ravens 34-31 source: cnn.com source: cnn.com

Are You Ready for Some Football? Super Bowl XLVII - February 3, 2013 Blackout Bowl Baltimore Ravens vs. San Francisco 49ers Won by Ravens 34-31 source: cnn.com source: cnn.com source: cnn.com

Strategies for Fault Mitigation James Hamilton - Vice President and Distinguished Engineer on the Amazon Web Services Blogged The Power Failure Seen Around the World http://bit.ly/1tbgbpy As when looking at any system faults, the tools we have to mitigate the impact are: 1) avoid the fault entirely, 2) protect against the fault with redundancy, 3) minimize the impact of the fault through small fault zones, and 4) minimize the impact through fast recovery. source: cnn.com

Cloud Fault Domains Fault Domain - group of resources that share a single point of failure. Resources in different fault domains fail independently. Instance - A single virtual resource. Zone - A sub-collection of resources in a region, typically a data center. Region - A geographic area, often comprised of multiple data centers. (Provider - Viable alternatives are emerging.) source: stackdriver.com

The Four Hamiltons Framework for Fault Mitigation in the Cloud High Scalability, http://bit.ly/1lp817l Cross Hamilton s mitigation strategies with cloud fault domains. Guide debate of approach and trade-offs for handling component failures. Customer Impact Size Avoid It Mask It Instance Zone Region Bound It Fix It Fast

Avoid It! Formerly, enterprise-grade (expensive) hardware. Now, solid architecture and good software engineering. source: onthesnow.com Techniques: Write good code. Test it thoroughly. Use high-quality software components (web servers, databases, etc.). Let someone else do it. Use hosted or managed services that do not fail. Our favorites include AWS RDS, AWS ELB, AWS SQS.

Bound It! Minimize scope of the failure to reduce customer impact. source: cnn.com Techniques: Limit impact by sharding. Degrade gracefully. Architect different subsystems/features to be independent. Browse without search, download without upload, use cached results.

Mask It! Use redundancy or replication to avoid customer impact. source: http://ucrtoday.ucr.edu/3827 Techniques: Use pools of peers/workers handling similar work. Master/slave, primary/secondary - with automatic failover. Clustering, quorums, gossip, peer-to-peer routing.

Fix It Fast! Don t rely on this strategy; You are doing it wrong! Techniques: source: dailymail.co.uk Revert code. Provision and deploy new resources. Restore from replicas or back-ups. Implement documented recovery procedures. Practice!!!

Switching Gears A healthy view of risk-taking The Four Hamiltons framework for designing robust architectures Examples from Stackdriver of costconscious experimentation source: teamamp.org

About the Stackdriver Infrastructure Key components: Data collection - querying cloud provider APIs Ingest pipeline - archiving/indexing billions of messages daily Alerting subsystem - evaluate user-defined policies Batch processing - aggregation and analysis UI - powerful graphing and visualization capabilities Custom automation framework Technology: Django, Angular, Python, Cassandra, ElasticSearch, MySQL, Rabbit, Puppet Heavy use of hosted services: ELB, RDS, SQS, and SNS Several hundred instances running in AWS. ~50 deployable units, pushing dozens of releases per day.

Stackdriver Ingest Pipeline Message Validation Purpose: Take data off the wire and get it where it needs to go. Message Broker Performed by set of cooperating components. Messaging with RabbitMQ Archive to S3 Drive the custom alerting pipeline Index to Cassandra, ElasticSearch Designed/built to tolerate instance failure. Strongly decoupled Multiple points for buffering Indexing Alerting Archiving

Scaling the Ingest Pipeline Load Balancer A cell is... the set of components needed to process a single message, the unit of scaling, independent from other cells, composed of instances in a single zone (tolerates zone failures). Much automation supports cell-based design. Data sinks (C*, ES, S3) handle full load.

Innovate Ingest at Scale Must continue to build, debug, fix, maintain, and enhance running pipeline. Big data problem characterized by 3Vs variety, volume, velocity But resources are scarce. Money, time, dev resources, ops overhead. Cannot simply deploy one of everything in a test environment. source: lovethesepics.com

Pipeline Testing for Variety Test/Dev Production Expose test environment to full variety of data. Replay raw data stored in archive.

Pipeline Testing for Velocity Test/Dev Production Expose a single cell to the load of a cell at line speeds. Federate traffic from the message broker in one cell to cell.

Pipeline Testing for Volume New Cassandra and indexer Production Expose downstream components to full system load. Add another consumer of the message broker in each cell.

Challenges Access control Components in test account have only read-only access to data Cross-account IAM Manage access to relational data Need to access config from prod Copy any mutable config Automation source: clubofthewaves.com

Conclusions Risk-taking is an important strategy for innovation, but requires cultural support Good system design is a safety net that helps protect you when experiments fail Use production systems and data to perform high-fidelity tests at low cost

Take Risks But Don t Be Stupid! Patrick Eaton, PhD preaton@google.com

Thank You! Questions? Patrick R. Eaton, PhD patrick@stackdriver.com