PostgreSQL migration from AWS RDS to EC2
Technology lover Worked as Software Engineer, Team lead, DevOps, DBA, Data analyst Sr. Tech Architect at Coverfox Email me at mistryhitul007@gmail.com Tweet me at @hitul007 Everything is possible but, it takes time - Hitul Mistry
Database Evolution Self hosted database AWS RDS Why we migrated from Database As Service to self hosted database Challenges in migration How we planned migration and point to be noted Current Architecture Demo
Simple GUI HA, Multi AZ. Encryption, Backup, Recovery, Disaster recovery, Security, Compliance SLAs Performance optimizations by self GUI for version upgrades
AWS RDS Postgresql DB functionality is similar to postgresql Cannot install extra extensions other then provided Cannot do replication to self hosted server Cannot install custom plugin to logical decoding Cannot install custom data-types Upgrades can be done at few clicks on GUI Self hosted PostgreSQL Postgresql DB functionality will be as postgresql original behaviour 100% control on functionality Upgrade needs to be done manually
AWS RDS HA, Fault Tolerance, Disaster recovery, Backups can be implemented at few GUI clicks You will have to monitor common parameters when postgresql can crash. It can crash when disk is full or CPU usage is high or other parameters. Postgresql will be auto restarted, disk is full. SLA Self hosted PostgreSQL HA, Fault Tolerance, Disaster recovery, Backups can be implemented by configuring postgresql by self We will have to monitor all the threats which can occur and be ready to fix those things Upgrades can be done by self SLA
AWS RDS Everything is available in GUI Postgresql usage knowledge and some architectural knowledge required AWS controlled performance tuning; not use-case dependent Self hosted PostgreSQL Postgresql Expert knowledge required Everything needs to be done by setting up configs
AWS RDS Few performance parameters can be tuned via GUI Cannot use custom hardware for performance Self hosted PostgreSQL Performance can be tuned via basic postgresql config and also other parameters like kernel, os, disk etc. Lot of performance parameters available to tune as per the application
AWS RDS To identify fault in postgresql you will be provided GUI where all the postgresql logs will be printed. New version upgrades can be done with few clicks Cannot go deep beyond DAAS service provides Self hosted PostgreSQL Can directly see logs of postgresql Can go deep as much want to go
AWS RDS Operating environment cannot be changed Self hosted PostgreSQL It can be moved to any operating environment
Cost to scale vertically or horizontally on aws is high Many open-source plugins required by the application cannot be installed on RDS New Logical decoding plugin for replication or other use-cases is not supported by RDS AWS takes time to upgrade to latest postgres versions Almost zero downtime server upgrades possible with self hosting Database auto scaling Performance tuning as per application needs
AWS RDS Cost (On Demand) Instance Type CPU RAM(GiB) Pricing/Yr m4.2xlarge 8 32 $ 9014.04 m4.4xlarge 16 64 $ 18045.6 m4.10xlarge 40 160 $ 45122.76 AWS EC2 Cost (On Demand) Instance Type CPU RAM(GiB) Pricing/Yr m4.2xlarge 8 32 $ 3679.2 m4.4xlarge 16 64 $ 7358.4 m4.10xlarge 40 160 $ 18396
If you we buy multi-az setup then cost will be doubled Reserved instance can save cost from 12-64% Rack servers and different cloud infrastructures usage for cost cutting Zero Downtime Upgrades Migration required extra hands, but self-hosted maintenance has not increased load on DevOPs team!
RDS supports limited plugins. Just now they added wal2json. Database Operation INSERT INTO data(data) VALUES('1'); INSERT INTO data(data) VALUES('2'); Format inserts table public.data: INSERT: id[integer]:1 data[text]:'1' table public.data: INSERT: id[integer]:2 data[text]:'2'
BEGIN 89283 table public.core_tracker: UPDATE: id[bigint]:63899671 session_key[text]:'w84fhz6c8b5jpc1ufesnegbxrfmnehh8' user_id[bigint]:23573 extra[text]:'{"h":100,"no_show":true}' created[timestamp with time zone]:'2018-01-05 16:03:23.654652+05:30' fd_id[integer]:null COMMIT 89283 BEGIN 89285 table public.core_tracker: UPDATE: id[bigint]:63899671 session_key[text]:'w84fhz6c8b5jpc1ufesnegbxrfmnehh8' user_id[bigint]:23573 extra[text]:'{"h":100,"no_show":true,"hello-world":{"sfs":"sdf\"2''3"}}' created[timestamp with time zone]:'2018-01-05 16:03:23.654652+05:30' fd_id[integer]:null COMMIT 89285
5M unique quotes a month 45M unique quotes from insurance companies 5GB write on DB and logs combined
Downtime allowed Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m 35.7s
Almost zero downtime Stable database after migration and should work as older one SLAs 99.99
Shared disk failover File system level Transaction log(wal) Trigger based Statement based
Most Reliable We cannot access pg_hba.conf You don t have enough permission to execute pg_start_backup
PgBadger CURRENT_TIMESTAMP, random, sequences will be affected Huge change in codebase
Postgresql utility to create postgresql dump and restore it Time consuming
Functions, Indexes, Constraints, are not migrated JSON considered as CLOB and truncated values Varchar, character varying values truncated DDL ignored Does not replicate partitioned tables
Does not replicate DDL Tough to parse output and output is not standard
Documentation and support was really limited If tools failed then whole replication will fail
All tables must have modified date All the tables must have primary key but, some tables had non non numeric primary keys
Mongodb + GoLang + Trigger based replication + pg_listen + pg_notify
Disable foreign key validations on self hosted postgresql db Create triggers on AWS RDS database Take backup of postgresql RDS
{ MongoDB Schema "table_name":"schema_name.table_name", "primary_key":"{primary_key_value}", "created_at":"timestamp", "operation":"insert/update/delete" }
Reset sequence Enable foreign key validations Stop AWS RDS instance Run basic Sanity scripts which will verify data on sample Stop website and it can be opened from internal users only Run QA tests Take backup of AWS RDS
How much downtime can be accepted? SLA What is the worst thing that can happen? Services which can be affected? How soon we can recover? How much data will be lost and can be recovered? What will be long term ROI? What data will be affected?
Plan should be like steps to execute Execute plan once on sandbox environment before going live Plan should include rollback strategy
CPU Disk type Disk size Connections Future plans and traffic RAID Power usage Network
Buffer size Kernel parameters Work memory Checkpoint
Our rollback strategy was similar Logical Decoding replication for 2 weeks
Write deployment scripts to setup database Write script for things as much you can
Sanity scripts QA team scripts and approval
High availability Fault tolerance Disaster recovery Backup & Recovery Hardware & Software updates Security Monitoring Testing Rollback Compliance
Promote master pg_ctl promote -D /data-dir-path/ Add cluster to PostgreSQL rm -rf /data-dir-path/ && pg_basebackup. Run pg_rewind after promote pg_rewind -D /data-dir-path --source-server=... host=...
Service discovery is the automatic detection of devices and services offered by these devices on a computer network. - Wikipedia
Service Discovery Health Checking KV store Mutli Data Center
Fork of Governer Developed at Zalando Used with Consul, ZooKeeper, Etcd
Image: http://aisaac.io
service/coverfox/optime/leader service/coverfox/members/master-a Leader node name Member of cluster service/coverfox/members/master-b Member of cluster
Errors data corruption system failure (including hardware failure) human error natural disaster
Tool for disaster recovery, backups and recovery by 2ndcondrant Remote backup Remote restore WAL Logs recovery
Barman Backup /usr/bin/barman backup --jobs 6 mumbai-master-a Barman restore db barman recover --target-time "2017-12-15 22:22:00" --remote-ssh-command "ssh postgres@x.x.x.x" mumbai-master-c 20171214T190201 /pg-data-dir-path/ -j 10
What to monitor? RAM/CPU Usage DIskIO Process info Bandwidth Vacuum running DB Space Active Connections Active Transactions Open files Replication diff PgBouncer client connections PgBouncer stats
Grafana InfluxDB Twilio Slack