Lessons learned while automating MySQL in the AWS cloud Stephane Combaudon DB Engineer - Slice
Our environment 5 DB stacks Data volume ranging from 30GB to 2TB+. Master + N slaves for each stack. Master is handling all application traffic. Specialized slaves (backups, reports, custom jobs). Stacks are duplicated in several dimensions Regions (US, JP) Environment (QA, Staging, Prod) 2
Problems we wanted to fix Hosted in the AWS cloud, but relying on a 3 rd party vendor for DB automation. 3 rd party vendor became a liability Expensive Automation only works with MySQL 5.5 Security issues Fail over unavailable 3
Our goals Create our own MySQL automation! Instance lifecycle DBA/SA people create instance from a template. Software gets provisioned automatically. Data gets provisioned automatically. Replication (if slave) starts automatically. Bonus: add ability to fail over to a slave easily. How can we get there? 4
Technical Solution Overview Creating instances from a template CloudFormation Installing software Chef Data provisioning Galera? Custom scripts? High availability Galera? MHA? 5
CloudFormation Provides a way to manage AWS resources through templates (infrastructure as code). A CloudFormation template Is a JSON file. Describes the configuration of your resources. Pro: any AWS resource can be described. Con: learning curve is steep. 6
AWS Components - 1 st try Master Slave N MHA Manager Slave 1 Standalone EC2 instance Autoscaling Group 7
Data Provisioning Galera Natively solves data provisioning + HA issue. But not a good fit for all our workloads + app changes needed. Let s write a custom provisioning script! For a master Do nothing. We only create a master for a new (empty) stack. For a slave Restore latest available backup. Start replication. But how will servers know whether they re a master or a slave? 8
MHA Automated vs semi-automated mode App is not ready for automatic MySQL failover. Semi automated mode is chosen Master failure detection is manual, slave promotion is a single command. MHA requirement MHA configuration needs to know the exact instances of the replication topology. 9
Back to AWS components CloudFormation allows you to add dependencies between components Create MHA Manager. Add IP of MHA Manager in some file of the MySQL servers when they are created by CloudFormation. During MySQL bootstrap, add IP of MySQL server to MHA config. But there s a catch: if MHA Manager goes down, we lose our failover ability. 10
AWS Components - 2 nd try Master Slave N MHA Manager Slave 1 Autoscaling Group Autoscaling Group of 1 instance 11
Back to AWS components again CloudFormation is no longer able to know the IP of the MHA Manager in advance. Therefore MySQL servers can no longer register themselves in MHA config file. This time again we need service discovery. 12
Service Discovery - 1 No such service available in our infrastructure. We tried several options Zookeeper, etcd: another infrastructure to manage. DynamoDB: race conditions. In the end the AWS API seemed a strong enough option. 13
Service Discovery - 2 All components in a CF stack share a tag (aws:cloudformation:stack name) Within a CF stack, names of ASGs are predictable. We can then find the IP address of all instances within a specific ASG 14
Back to MHA config Now instances are able to register themselves when bootstrapping. Upon instance termination MHA config needs to be updated. Hooks can be added to run custom script, but not very fast. What else can we do? 15
Another MHA problem - 1 MHA command lines are not very user friendly. mha@manager$ masterha_master_switch --conf=/etc/mha.conf --master_state=alive --new_master_host=172.25.2.73 --orig_master_is_new_slave --interactive=0 We built a wrapper script. Simpler options Autocompletion root@manager# db_ha promote --new_master=172.25.2.73 16
Another MHA problem - 2 Wait, couldn t we also sync the MHA conf in this when running this script? Yes, of course! MHA conf is synced on demand with this script Ensures the conf is always up-to-date when we need it. No more need to care about MySQL instance termination. 17
Another MHA problem - 3 So far, so good but Some of our slaves are not suitable at all to become master. We want no_master=1 in MHA config for these servers. MHA Manager just knows a bunch of MySQL servers, how can it add the no_master flag? We need to refine our AWS components diagram again. 18
AWS Components - 3 rd try Master Slave N MHA Manager Slave 1 ASG1 (Potential Masters) ASG2 (Slaves only) Autoscaling Group of 1 instance 19
Recap so far At this point We can create an arbitrary number of MySQL servers. MHA config is synced automatically. Any node (MySQL or Manager) that fails is rebuilt automatically thanks to ASGs. All good? Not exactly 20
Back to Data Provisioning We have separate code paths for master and slaves. But how do we know if a new instance is a master or a slave? Let s use AWS API again If the instance is part of ASG2: slave. If the instance is part of ASG1: 1 st instance is master, others are slaves. We add a replication_role tag for each instance. 21
Backups - XtraBackup vs EBS snapshots EBS snapshots Simple to use and super fast (incremental backups). XtraBackup Very complex, super slow. Incremental backups are difficult. Let s use EBS snapshots then? Well, not so fast 22
Backups vs Restores EBS snapshots are great for backups, not for restores. Data is lazily loaded from S3, ie warmup takes forever. Example on our write-heaviest cluster Restore + replication catchup with XB: 8-9 hours. Same with EBS snapshots: I gave up after 2 days. 23
Backup Script XtraBackup takes full backup. Backup is uploaded to S3. Frequency of backups is stack dependent Configuration file in S3 Tags are added on backup servers Timestamp and status of latest backup. Progress bar if a backup is taken. 24
Roadmap Migration to 5.7 Automation already supports both 5.5 and 5.7. Better monitoring of errors on restores. Integration with PMM Implemented but broken. Realtime binlog streaming to Elastic Filesystem Implemented but broken. Group Replication instead of MHA. 25
The end Thanks for attending!! Questions/comments stephane@slice.com 26