@joerg_schad Nightmares of a Container Orchestration System

Similar documents
Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Scale your Docker containers with Mesos

CONTINUOUS DELIVERY WITH MESOS, DC/OS AND JENKINS

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

CONTINUOUS DELIVERY WITH DC/OS AND JENKINS

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Building a Data-Friendly Platform for a Data- Driven Future

Building/Running Distributed Systems with Apache Mesos

Using DC/OS for Continuous Delivery

Mesosphere and Percona Server for MongoDB. Jeff Sandstrom, Product Manager (Percona) Ravi Yadav, Tech. Partnerships Lead (Mesosphere)

Mesosphere and Percona Server for MongoDB. Peter Schwaller, Senior Director Server Eng. (Percona) Taco Scargo, Senior Solution Engineer (Mesosphere)

SCALING LIKE TWITTER WITH APACHE MESOS

Advantages of using DC/OS Azure infrastructure and the implementation architecture Bill of materials used to construct DC/OS and the ACS clusters

Deploying Applications on DC/OS

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

The Emergence of the Datacenter Developer. Tobi Knaup, Co-Founder & CTO at

Distributed Data on Distributed Infrastructure. Claudius Weinberger & Kunal Kusoorkar, ArangoDB Jörg Schad, Mesosphere

Spark and Flink running scalable in Kubernetes Frank Conrad

AGILE DEVELOPMENT AND PAAS USING THE MESOSPHERE DCOS

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

Windows Azure Services - At Different Levels

Issues Fixed in DC/OS

Sunil Shah SECURE, FLEXIBLE CONTINUOUS DELIVERY PIPELINES WITH GITLAB AND DC/OS Mesosphere, Inc. All Rights Reserved.

Mesosphere and the Enterprise: Run Your Applications on Apache Mesos. Steve Wong Open Source Engineer {code} by Dell

UPGRADING A MESOS CLUSTER

MANAGING MESOS, DOCKER, AND CHRONOS WITH PUPPET

Building A Better Test Platform:

Introduction to Mesos and the Datacenter Operating System

EsgynDB Enterprise 2.0 Platform Reference Architecture

SAMPLE CHAPTER IN ACTION. Roger Ignazio. FOREWORD BY Florian Leibert MANNING

Scalable Streaming Analytics

Exam : Implementing Microsoft Azure Infrastructure Solutions

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Beyond 1001 Dedicated Data Service Instances

Building and Running a Solr-as-a-Service SHAI ERERA IBM

A Whirlwind Tour of Apache Mesos

Note: Isolation guarantees among subnets depend on your firewall policies.

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

DC/OS Metrics. (formerly known as Project Ambrose) Application and Resource Metrics in DC/OS Enterprise. Nick Parker at..

Best Practices for Developing & Deploying Java Applications with Docker

Container Orchestration on Amazon Web Services. Arun

InterSystems Cloud Manager & Containers for InterSystems Technologies. Luca Ravazzolo Product Manager

Kontejneri u Azureu uz pomoć Kubernetesa što i kako? Tomislav Tipurić Partner Technology Strategist Microsoft

Architecting for Failure in a Containerized World. Tom Faulhaber Infolace

From Zero to Production Our Mesos Journey. Chien Huey DevOps Engineer XO Group, Inc.

Fault Domains in Mesos. Vinod Kone

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

FROM MONOLITH TO DOCKER DISTRIBUTED APPLICATIONS

Zabbix on a Clouds. Another approach to a building a fault-resilient, scalable monitoring platform

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Enabling Low-friction Continuous Deployment via Workflows and Containers

How to Keep UP Through Digital Transformation with Next-Generation App Development

MapR Enterprise Hadoop

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Cloud + Big Data Putting it all Together

Next-Generation Cloud Platform

Building Durable Real-time Data Pipeline

Orchestration Ownage: Exploiting Container-Centric Datacenter Platforms

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

This tutorial will give you a quick start with Consul and make you comfortable with its various components.

Docker 101 Workshop. Eric Smalling - Solution Architect, Docker

A Generic Microservice Architecture for Environmental Data Management

Deploy Like A Boss Oliver Nicholas

Containers and the Evolution of Computing

Network Function Virtualization over Open DC/OS Yung-Han Chen

Table of Contents 1.1. Introduction. Overview of vsphere Integrated Containers 1.2

STATE OF MODERN APPLICATIONS IN THE CLOUD

Containerization Dockers / Mesospere. Arno Keller HPE

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud. Santa Clara, California April 23th 25th, 2018

Migrating to Cassandra in the Cloud, the Netflix Way

Personal Statement. Skillset I MongoDB / Cassandra / Redis / CouchDB. My name is Dale-Kurt Murray. I'm a Solutiof

Create an application with local persistent volumes

Google Cloud Platform for Systems Operations Professionals (CPO200) Course Agenda

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

POWERING THE INTERNET WITH APACHE MESOS

Azure Certification BootCamp for Exam (Developer)

SECURING A MARATHON INSTALLATION 2016

We are ready to serve Latest IT Trends, Are you ready to learn? New Batches Info

Best practices for OpenShift high-availability deployment field experience

Become a MongoDB Replica Set Expert in Under 5 Minutes:

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

Table of Contents 1.1. Overview. Containers, Docker, Registries vsphere Integrated Containers Engine

Scaling Marketplaces at Thumbtack QCon SF 2017

Cloud I - Introduction

Cisco Tetration Analytics

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

Integrate MATLAB Analytics into Enterprise Applications

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

OpenStack Magnum Pike and the CERN cloud. Spyros

Esper EQC. Horizontal Scale-Out for Complex Event Processing

/ Cloud Computing. Recitation 5 February 14th, 2017

Armon HASHICORP

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017

Namenode HA. Sanjay Radia - Hortonworks

Scaling Up Performance Benchmarking

This document provides instructions for upgrading a DC/OS cluster.

Logging, Monitoring, and Alerting

Think Small to Scale Big

Transcription:

@joerg_schad Nightmares of a Container Orchestration System 2017 Mesosphere, Inc. All Rights Reserved. 1

Jörg Schad Distributed Systems Engineer @joerg_schad Jan Repnak Support Engineer/ Solution Architect @jrx 2017 Mesosphere, Inc. All Rights Reserved. 2

3AM 2017 Mesosphere, Inc. All Rights Reserved. 3

Scope of Jan s nightmares Modern App Components Big Data + Analytics Microservices (in containers) Engines Functions & Logic Streaming Batch Machine Learning Search Time Series SQL / NoSQL Analytics Databases Datacenter Operating System (DC/OS) Distributed Systems Kernel (Apache Mesos) Any Infrastructure (Physical, Virtual, Cloud) 2017 Mesosphere, Inc. All Rights Reserved. 4

Backup State Why should we have backup? Backup state Services Cluster 2017 Mesosphere, Inc. All Rights Reserved. 5

Immutable Container Images Use tagged container images Keep tagged images immutable! 2017 Mesosphere, Inc. All Rights Reserved. 6

Private Container Registries Dockerhub works great for our test cluster Use tagged container images Keep tagged images immutable! Use a private container registry! 2017 Mesosphere, Inc. All Rights Reserved. 7

Repeatable Container Builds `docker commit` is great* Use repeatable builds for images Including FROM clause Keep images minimal Multistage build From scratch 2017 Mesosphere, Inc. All Rights Reserved. 8

UI Deployments Use (Marathon) endpoints for deployments! Version (and track) your app definitions! 2017 Mesosphere, Inc. All Rights Reserved. 9

Disk Usage All our disk are full Docker and logs are great in filling up disk space! Images Container Cleanup docker instances and images! docker prune https://github.com/spotify/docker-gc Monitor available disk space! 2017 Mesosphere, Inc. All Rights Reserved. 10

Resource Constraints 32MB are sufficient for Docker container Memory constraints are hard limits Consider overhead (e.g., Java) Difficult to approximate Monitor 2017 Mesosphere, Inc. All Rights Reserved. 11

Zookeeper Cluster Size Our Zookeeper Cluster has 4 nodes, that is better than 3, or? Zookeeper quorum (i.e., #Masters) should be odd! Production 5 is optimal! 2017 Mesosphere, Inc. All Rights Reserved. 12

Health Checks What are health checks? Specify Health checks carefully Different options Mesos vs Marathon, Command vs HTTP Impacts Load-Balancers and restarts Readiness checks 2017 Mesosphere, Inc. All Rights Reserved. 13

NoSQL Datastores We replaced out Postgres instance with Cassandra, and now we get stale results Consider the semantics of your datastore! ACID vs Base Model your data and queries accordingly! 2017 Mesosphere, Inc. All Rights Reserved. 14

Removing Stateful Frameworks Follow the uninstall instructions! Reservations and Zookeeper state! state.json 2017 Mesosphere, Inc. All Rights Reserved. 15

Container vs VMs We just replaced all our VM instances by containers* Be aware of different isolation semantics! 2017 Mesosphere, Inc. All Rights Reserved. 16

Write Once Run Any Where The (Java) container was running fine in testing Java (<9) not groups aware # threads for GC Set default values carefully 2017 Mesosphere, Inc. All Rights Reserved. 17

Mesos Modules To solve this problem, our team quickly developed this really cool Mesos Module Mesos Modules can be tricky! Monitoring and Debugging 2017 Mesosphere, Inc. All Rights Reserved. 18

Linux Distributions We are using *obscure Linux distribution* for our (DC/ OS) cluster If possible use tested distributions! Especially for DC/OS! 2017 Mesosphere, Inc. All Rights Reserved. 19

Services on the same node We are running *distributed Database* outside Mesos on the same cluster Be careful when running services outside Mesos but on the same cluster! Adjust resources accordingly! 2017 Mesosphere, Inc. All Rights Reserved. 20

Spreading out Master Nodes We are running our cluster across different AWS regions.. Be careful when distributing Master nodes across high latency links! Different AWS AZ ok, different region probably not! 2017 Mesosphere, Inc. All Rights Reserved. 21

Agent Attributes attributes='rack:abc;zone:west; os:centos5;level:10;keys:[1000-1500]' We changed the agent attributes for running cluster Set agent attributes when starting an agent! Do not change for running agents! 2017 Mesosphere, Inc. All Rights Reserved. 22

Cluster Upgrades We upgraded our cluster * Check state before Follow upgrade instructions! Automation Remember Backup! 2017 Mesosphere, Inc. All Rights Reserved. 23

UPGRADE PROCEDURE Before upgrading 1. Make sure cluster is healthy! 2. Perform backup a. ZK b. Replicated logs c. other state 3. Review release notes 4. Generate install bundle a. Validate versions Agent Executor Task Framework Scheduler LEADER STANDBY STANDBY Agent Executor Task ZK ZK ZK 2017 Mesosphere, Inc. All Rights Reserved. 24

UPGRADE PROCEDURE 1. Master rolling upgrade a. Start with standby b. Uninstall DC/OS c. Install new DC/OS 2. Agent rolling upgrade 3. Framework upgrades Agent Framework Scheduler ZK ZK LEADER STANDBY STANDBY Agent ZK Executor Task Executor Task 2017 Mesosphere, Inc. All Rights Reserved. 25

UPGRADE PROCEDURE Framework ZK 1. Master rolling upgrade 2. Agent rolling upgrade a. Uninstall DC/OS b. Install new DC/OS 3. Framework upgrades Scheduler ZK LEADER STANDBY STANDBY ZK Agent Executor Task Agent Executor Task 2017 Mesosphere, Inc. All Rights Reserved. 26

UPGRADE PROCEDURE Framework ZK 1. Master rolling upgrade 2. Agent rolling upgrade 3. Framework upgrades a. Orthogonal to DC/OS b. Ensure changes don t affect existing apps Agent Executor Task Scheduler LEADER STANDBY STANDBY Agent Executor Task ZK ZK 2017 Mesosphere, Inc. All Rights Reserved. 27

Software Upgrades We have automatic updates enabled for Docker Follow upgrade instructions! Backup! Explicit control of versions! 2017 Mesosphere, Inc. All Rights Reserved. 28

Day 2 Operations Our POC app is deployed in our production environment, time for vacation Keep it running! Day 2 Operations is the actually challenging part! 2017 Mesosphere, Inc. All Rights Reserved. 29

Day 2 Operations Configuration Updates (ex: Scaling, reconfiguration) Binary Upgrades Cluster Maintenance (ex: Backup, Restore, Restart) Monitor progress of operations Debug any runtime blockages 2017 Mesosphere, Inc. All Rights Reserved. 30

METRICS Measurements captured to determine health and performance of cluster - How utilized is the cluster? - Are resources being optimally used? - Is the system performing better or worse over time? - Are there bottlenecks in the system? - What is the response time of applications? 2017 Mesosphere, Inc. All Rights Reserved. 31

DC/OS METRIC SOURCES Mesos metrics Resource, frameworks, masters, agents, tasks, system, events Container Metrics CPU, mem, disk, network Application Metrics QPS, latency, response time, hits, active users, errors App App App Container Container Container Mesos OS 2017 Mesosphere, Inc. All Rights Reserved. 32

Production Checklist 2017 Mesosphere, Inc. All Rights Reserved. 33

MESOS CHECKLIST Monitor both Masters and Agents for flapping (i.e., continuously restarting). This can be accomplished by using the `uptime` metric. Monitor the rate of changes in terminal task states, including TASK_FAILED, TASK_LOST, and TASK_KILLED 2017 Mesosphere, Inc. All Rights Reserved. 34

MESOS MASTER CHECKLIST Use five master instances in production. Three is sufficient for HA in staging/test Place masters on separate racks, if possible Secure the teardown endpoints to prevent accidental framework removal. 2017 Mesosphere, Inc. All Rights Reserved. 35

MESOS AGENT CHECKLIST Set agent attributes before you run anything on the cluster. Once an agent is started, changing the attributes may break recovery of running tasks in the event of a restart. See also https:// issues.apache.org/jira/browse/mesos-1739. Explicitly set the resources on the nodes to leave capacity for other services running there outside of Mesos control. For example, HDFS processes running alongside Mesos. 2017 Mesosphere, Inc. All Rights Reserved. 36

ZOOKEEPER CHECKLIST Run with security and ACLs, see the `--zk=` and `--master=` flags on the master and slaves respectively. If you do enable ACLs, they must be enabled before nodes are created in ZK. Backup ZooKeeper snapshots and log at regular intervals. - Guano or zkconfig.py (Want Snapshots + Transaction Log) Marathon, Chronos, and other frameworks store state in ZK. The first Marathon should store state in the same ZK as Mesos master. Userland apps should NOT store state in the ZK cluster shared by Mesos and Marathon. Examples of userland apps include Storm, service discovery tools, and additional instances of Marathon and Chronos. 2017 Mesosphere, Inc. All Rights Reserved. 37

ZOOKEEPER CHECKLIST Monitor ZK's JVM metrics, such as heap usage, GC pause times, and full-collection frequency. Monitor ZK for: number of client connections, total number of znodes, size of znodes (min, max, avg, 99% percentile), and read/write performance metrics 2017 Mesosphere, Inc. All Rights Reserved. 38

@joerg_schad @dcos ANY QUESTIONS? @dcos chat.dcos.io users@dcos.io /groups/8295652 /dcos /dcos/examples /dcos/demos 2017 Mesosphere, Inc. All Rights Reserved.