PGConf.Russia 2019, Moscow. Alexander Kukushkin

Similar documents
Pump up your elephants with Patroni. PGDay.IT 2018 Lazise

Highway to Hell or Stairway to Cloud?

PostgreSQL migration from AWS RDS to EC2

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PostgresConf US

Become a MongoDB Replica Set Expert in Under 5 Minutes:

High Availability- Disaster Recovery 101

High Availability- Disaster Recovery 101

High Availability and Automatic Failover in PostgreSQL using Open Source Solutions

Database Backup and Recovery Best Practices. Manjot Singh, Data & Infrastrustructure Architect

Disaster Recovery and Mitigation: Is your business prepared when disaster hits?

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Still All on One Server: Perforce at Scale

White paper High Availability - Solutions and Implementations

SQL Server Virtualization 201

WHITE PAPER. Header Title. Side Bar Copy. Header Title 5 Reasons to Consider Disaster Recovery as a Service for IBM i WHITEPAPER

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN

New England Data Camp v2.0 It is all about the data! Caregroup Healthcare System. Ayad Shammout Lead Technical DBA

Configuring and Deploying a Private Cloud DURATION: Days

Designing Modern Apps Using New Capabilities in Microsoft Azure SQL Database. Bill Gibson, Principal Program Manager, SQL Database

Postgres Cluster and Multimaster

Disaster Recovery Planning

Hystax. Live Migration and Disaster Recovery. Hystax B.V. Copyright

Azure Webinar. Resilient Solutions March Sander van den Hoven Principal Technical Evangelist Microsoft

Understanding High Availability options for PostgreSQL

HowTo DR. Josh Berkus PostgreSQL Experts SCALE 2014

HowTo DR. Josh Berkus PostgreSQL Experts pgcon 2014

DISASTER RECOVERY TESTING, YOUR EXCUSES, AND HOW TO WIN

FOSDEM 2018 Brussels, Belgium. Magnus Hagander

Efficiently Backing up Terabytes of Data with pgbackrest

AlwaysOn Availability Groups: Backups, Restores, and CHECKDB

HIGH-AVAILABILITY & D/R OPTIONS FOR MICROSOFT SQL SERVER

Scale out Read Only Workload by sharing data files of InnoDB. Zhai weixiang Alibaba Cloud

arcserve r16.5 Hybrid data protection

High Availability for Postgres using OpenSource tools. By Jobin Augustine & HariKrishna

OL Connect Backup licenses

MySQL Architecture Design Patterns for Performance, Scalability, and Availability

Disaster Recovery Options

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Backup vs. Business Continuity: Using RTO to Better Plan for Your Business

Maintain Data Control and Work Productivity

` 2017 CloudEndure 1

Backup and Disaster Recovery: DIY or Buy? Presented by: Stanley Louissaint

Elevate the Conversation: Put IT Resilience into Practice for Cloud Service Providers

High Availability & Disaster Recovery. Witt Mathot

Postgres in Amazon RDS. Denish Patel Lead Database Architect

The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring. Ashish Ray Group Product Manager Oracle Corporation

Solution Pack. Managed Services Virtual Private Cloud Managed Database Service Selections and Prerequisites

How CloudEndure Disaster Recovery Works

White Paper: Backup vs. Business Continuity. Backup vs. Business Continuity: Using RTO to Better Plan for Your Business

SQL Server Availability Groups

How CloudEndure Works

Zero Downtime Migrations

How CloudEndure Disaster Recovery Works

PostgreSQL Replication 2.0

Virtual Disaster Recovery

SAP HANA Disaster Recovery with Asynchronous Storage Replication

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PGConf.EU 2017, Warsaw

Are AGs A Good Fit For Your Database? Doug Purnell

Windows Clustering 101

Copyright 2012 EMC Corporation. All rights reserved.

Exam : Implementing a Cloud Based Infrastructure

Configuration Guide for Veeam Backup & Replication with the HPE Hyper Converged 250 System

MSc, Computer & Systems TalTech. Writes on 2ndQuadrant blog From Turkey Lives in

Wong Tze Chuan General Manager. Gadget Wearable Tech (M) Sdn Bhd

Step into the future. HP Storage Summit Converged storage for the next era of IT

Which technology to choose in AWS?

How CloudEndure Works

Disaster Recovery Is A Business Strategy

4 Criteria of Intelligent Business Continuity

Deploy. A step-by-step guide to successfully deploying your new app with the FileMaker Platform

EMC GLOBAL DATA PROTECTION INDEX STUDY KEY RESULTS & FINDINGS FOR THE USA

ECE Engineering Robust Server Software. Spring 2018

Aurora, RDS, or On-Prem, Which is right for you

Chapter 1. Storage Concepts. CommVault Concepts & Design Strategies:

Recovery at a Click - where to be in 18 months

Business Resiliency in the Cloud: Reality or Hype?

Where s Your Third Copy?

Identifying Workloads for the Cloud

Welcome! Considering a Warm Disaster Recovery Site?

What is the Future of PostgreSQL?

Breaking PostgreSQL at Scale. Christophe Pettus PostgreSQL Experts FOSDEM 2019

SAN for Business Continuity

BUSINESS CONTINUITY: THE PROFIT SCENARIO

Real-time Protection for Microsoft Hyper-V

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

The Data Protection Rule and Hybrid Cloud Backup

Efficiently Backing up Terabytes of Data with pgbackrest. David Steele

Chapter 11. SnapProtect Technology

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com

Untangling the PostgreSQL upgrade

Module 4 STORAGE NETWORK BACKUP & RECOVERY

Backup and Restore Strategies

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR HONG KONG

InterSystems High Availability Solutions

CLOUDVPS SERVICE LEVEL AGREEMENT CLASSIC VPS

Build a Better Disaster Recovery Plan to Improve RTO & RPO Lubomyr Salamakha

HIGH AVAILABILITY AND DISASTER RECOVERY FOR IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF RUSSIA

EMC GLOBAL DATA PROTECTION INDEX KEY FINDINGS & RESULTS FOR UAE

Running MySQL on AWS. Michael Coburn Wednesday, April 15th, 2015

Transcription:

Типичные ошибки при построении высокодоступных кластеров и как их избежать PGConf.Russia 2019, Moscow Alexander Kukushkin 06-02-2018

ABOUT ME Alexander Kukushkin Database Engineer @ZalandoTech The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn 2

WE BRING FASHION TO PEOPLE IN 17 COUNTRIES 17 markets 7 fulfillment centers 23 million active customers 4.5 billion net sales 2017 200 million visits per month 15,000 employees in Europe 3

FACTS & FIGURES > 300 databases on premise > 650 clusters in the Cloud (AWS) 4

AGENDA What is High Availability? Disaster recovery Automatic failover done right Examples of real incidents What HA will not solve? Wrap it up 5

What is High Availability?

Availability 7

Causes of Downtime 8 Scheduled downtime (often excluded from availability) Hardware/BIOS/Firmware upgrade Software update Unscheduled downtime Datacenter failure (natural disasters, fire, power outage) Network splits Hardware failure (CPU, network card, disk controller, disk) Software/Data corruption (Bugs in application/os code) User error (rm -fr $PGDATA, DROP/TRUNCATE table, UPDATE/DELETE without WHERE clause)

Downtime Availability 9 Year Month Week Day 99% ( Two nines ) 3.65 d 7.31 h 1.68 h 14.4 m 99.9% ( Three nines ) 8.77 h 43.83 m 10.08 m 1.44 m 99.95% ( Three and a half nines ) 4.38 h 21.92 m 5.04 m 43.2 s 99.99% ( Four nines ) 52.6 m 4.38 m 1.01 m 8.64 s 99.999% ( Five nines ) 5.26 m 26.3 s 6.05 s 864 ms 99.9999% ( Six nines ) 31.56 s 2.63 s 604.8 ms 86.4 ms 99.99999% ( Seven nines ) 3.16 s 262.98 ms 60.48 ms 864 μs

What is HA anyway? No Official Definition appears to exist! Wikipedia: High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. 10

SLA, SLI, and SLO 11 A Service-Level Agreement (SLA) is an agreement between a service provider and a client. Type of service to be provided Desired performance level (especially availability, reliability and responsiveness) Monitoring process and service level reporting Steps for reporting issues Response and issue resolution time-frame A Service-Level Indicator (SLI) is a measure of the service level provided by a service provider to a customer Availability Latency Throughput A Service-Level Objective (SLO) is a key element of SLA; a goal that service provider wants to reach

Causes of Unscheduled Downtime 12 Hardware failure Automatic failover Network splits Datacenter failure Software failure/data corruption Disaster recovery User error

What is High Availability? Disaster recovery Automatic failover done right Examples of real incidents What HA will not solve? Wrap it up 13

Disaster recovery 14 Involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster Recovery point objective (RPO) and recovery time objective (RTO) are two important measurements in disaster recovery and downtime A recovery point objective (RPO) is defined by business continuity planning. It is the maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity

Disaster recovery $ Data recovery price Data loss price High Availability price Service downtime price https://en.wikipedia.org/wiki/file:rpo_rto_example_converted.png 15

RPO, RTO & PostgreSQL 16 Automatic failover won t help to backup and restore data Enable backups and log archiving archive_timeout - how often postgres should archive WALs pg_receivewal Recovery from the backup might take hours Consider having a delayed replica (recovery_min_apply_delay) if RTO is higher than 15 minutes, you don t need automatic failover! Unless you are running hundreds of clusters synchronous replication - to prevent data loss during failover

Sub-second Automatic Failover In general it is possible, but VERY expensive This is a price for complexity of such system Complexity is often decreasing availability The more elements a system has, the more reliable each element has to be 17 Trade-off between the speed of failure detection and false positives

High Availability and Disaster Recovery Need Each Other!

What is High Availability? Disaster recovery Automatic failover done right Examples of real incidents What HA will not solve? Wrap it up 19

Multimaster? PostgreSQL XC/XL Data nodes + Coordinators + 2PC + GTM(SPOF) BDR logical replication + conflict resolution Postgres Pro Enterprise (proprietary) 20 eventual consistency logical replication + E3PC

A good HA system Quorum Helps to deal with network splits Requires at least 3 nodes Fencing Watchdog 21 Make sure the old primary is unaccessible. STONITH! Primary should not run if supervising HA process failed

No Quorum and no Fencing health check Primary wal stream 22 Standby

No Quorum and no Fencing Primary Primary https://github.com/masahikosawada/pg_keeper 23

Witness node is making decisions Primary wal stream health check eck h c h t l hea witness 24 Standby

Witness node dies Primary wal stream witness 25 Standby

Witness and no Fencing Primary wal stream health check eck h c h t l hea witness 26 Standby

Witness and no Fencing Primary Primary eck h c h t l hea witness 27

Automatic failover done right Primary I am Standby the ged? n a h er c Lead lea der Quorum 28

What is High Availability? Disaster recovery Automatic failover done right Examples of real incidents What HA will not solve? Wrap it up 29

Learn your HA system GoCardless Incident 30 Sync VIP Primary Async Pacemaker Pacemaker Pacemaker

GoCardless Incident Failed raid controller on the primary Primary was manually terminating (hope on auto failover) Auto Failover didn t work due to coincident crash of postgres on sync replica 31 Spend 1h30m trying to trigger a failover using Pacemaker! Manually promoted sync replica Total outage 1h50m

GoCardless Lessons HA systems usually are quite complex Running them is similar to flying modern airplane Mostly autopilot But sometimes it fails You need to know how to fly manually Learn your HA system 32 Try to break it and fix afterwards

Resource Planning GitHub Incident US West US East pages pages Latency 60ms replica jobs replica primary notifications replica notifications replica 33 github.com

GitHub Incident Network split due to network maintenance Automatic failover from East to West Coast datacenter Applications from East are slow due to latency between East and West Switchback to East wasn t possible due to a few seconds of writes which were not replicated 34 Rebuild of all replicas in the East took nearly 16 hours Total time of incident 24h11m

GitHub Lessons Avoid doing cross-region failover if you don t have 100% resources symmetry 35 MySQL can t do pg_rewind :)

Broken Disaster Recovery procedures GitLab Incident 36 Primary-Replica setup (no automatic failover) Increased database load on the primary resulted in replica falling behind WAL segment needed for replica was recycled by primary A few attempts to rebuild replica with pg_basebackup (--checkpoint=spread) rm -fr $PGDATA on the primary! (human error) Three different backups were done only once a day (no WAL archiving) pg_dump was always failing due to major version mismatch! Azure disk snapshots were disabled for database servers! LVM snapshots were working and periodically tested by restoring them to staging Incident happened nearly 24 hours after the last snapshot was taken! Luckily, someone manually created the snapshot 6 hours before the incident Recovery from LVM snapshot took longer than 18 hours

GitLab Lessons RPO and RTO were not set or not adequate to their business needs Daily snapshots only and no WAL archiving (RPO = 24 hours) Streaming replication can t be used for Disaster Recovery 37 Unless it is a delayed replica Runbooks can t replace fire-drills Backups must be monitored and tested

What is High Availability? Disaster recovery Automatic failover done right Examples of real incidents What HA will not solve? Wrap it up 38

Monitoring HA doesn t solve all problems with postgres, it won't cover: Hardware errors, CPU load, Memory, etc... Disk space for $PGDATA, tablespaces and pg_wal autovacuums, checkpoints Tables and indexes bloat Queries performance etc... Depending on RPO you maybe don t need HA at all, but monitoring is a must 39 Don t forget to monitor your HA system!

Everything must be monitored Monitoring High Availability 40 Disaster Recovery

Configuration tuning OS configuration tuning Huge pages, shared memory, semaphores, overcommit, etc... PostgreSQL configuration tuning shared_buffers, max_wal_size, checkpoint completion_target, random_page_cost, etc... 41 HA won t do it for you!

What is High Availability? Disaster recovery Automatic failover done right Examples of real incidents What HA will not solve? Wrap it up 42

Wrap it up Always start with Disaster Recovery planning Define RPO and RTO 43 Depending on RTO you maybe don t need HA Build the Availability you need, not the Availability you want Test everything High Availability system Backups! Do regular fire-drills

Thank you! Questions?