Preventing and Resolving MySQL Downtime Jervin Real, Michael Coburn Percona
About Us Jervin Real, Technical Services Manager Engineer Engineering Engineers APAC Michael Coburn, Principal Technical Account Manager Responsible for managing technical relationship with Percona's highest revenue customers 2
What is Downtime? When your Application is completely unavailable When your Application is in a degraded state Whenever your boss says so :) 3
Why Prevent Downtime? Your business loses money when the Application is down You and your team's reputation suffers 4
Agenda Real world adventures Problems Solutions Prevention Putting them all together 5
6 I Had a Crash On You
7 I Had a Crash On You (1): Page Corruption
I Had a Crash On You (1): Page Corruption > About Disk bad sectors problem, not monitored or checked Page corruption on disk level Server crashes when reading page from disk Keeps crashing :( 8
I Had a Crash On You (1): Page Corruption > Solutions Percona Server, we tried: innodb_table_corrupt_action = salvage Worked! Dropped table, recreated - application back online Worst case: innodb_force_recovery > 0 Data Recovery 9
I Had a Crash On You (2): Assertion > About Running 5.6.11, early adopter, InnoDB FULLTEXT Upgrade to 5.6.18, MySQL crashed Data was unusable - bug#72079 10
I Had a Crash On You (2): Assertion > Solutions Downgrade and restore from backup Re-execute upgrade to avoid the bug 11
I Had a Crash On You (1): Page Corruption > Preventions innodb_corrupt_table_action=salvage / warn pt-table-checksum Regularly recurse your data and check for errors in error log RAID card health checks Can vary by vendor SMART checks Be vigilant for disk level errors 12
13 Nobody s Watching
Nobody s Watching (1): Nobody Cared > About Percona XtraDB Cluster, 3 nodes Few months ago node 3 went down due to conflict, but nobody noticed Few hours ago, node 2 was killed by OOM, cluster lost quorum EVERYBODY NOTICED! 14
Nobody s Watching (1): Nobody Cared > Solutions Bootstrap remaining node SET GLOBAL wsrep_provider_options= pc.bootstrap=1 ; SST second and 3rd node Define wsrep_notify_cmd temporarily Implement better alerting 15
Nobody s Watching (2): Dropped the Bomb > About New sysadmin received disk space alert du -hx --max-depth=1 / /var has lots of data find /var/ -size +5G -exec rm -rf {} \; Bam, ibdata1 gone! Restart maintenance occurred later in the day... 16
Nobody s Watching (2): Dropped the Bomb > Solutions Restore from backup Really, they were lucky! 17
Nobody s Watching: Prevention Percona Monitoring Plugins pmp-check-deleted-files pmp-check-mysql-status pmp-check-mysql-innodb Define a script executable by mysql user Triggered on node state changes Take backups, and alert on failure Don't restart the server - file handles are still open! 18
19 Self Induced Pain
Self Induced Pain (1): Query Cache Waiting for query cache lock root# ~> pt-sift /var/lib/pt-stalk/... --processlist-- State 226 90 Waiting for query cache lock 4 Sending data 4 Master has sent all binlog to slave; waiting for binlog to be updated 2 init 20
Self Induced Pain (1): Query Cache > About Global mutex Point of contention Especially on hot dataset/table More so, with large QC 21
Self Induced Pain (1): Query Cache > Solutions Set it to small size - to reduce performance overhead Disable completely to to avoid contention Hint offending queries to skip the query cache i.e. SELECT SQL_NO_CACHE 22
Self Induced Pain (2): Buffer Pool Dump/Restore Dumps buffer pool page list to disk Reloads buffer pool based on this list at startup Meant to help speed up buffer pool warmup 23
Self Induced Pain (2): Buffer Pool Dump/Restore > About Maintenance restart, buffer dump and restore enabled Yey! Expecting everything to go well. 30mins in performance still really bad, IO trashing Large buffer pool, busy read/write 24
Self Induced Pain (2): Buffer Pool Dump/Restore > Solutions Extend your maintenance period to let the server warmup if possible, otherwise they will contend on IO RAID1 of 2 SATA disks is not a license to use buffer pool warmup on 240GB of buffer pool 25
Self-Induced Pain Prevention Percona Toolkit pt-stalk pt-sift pt-kill Disable OOM killer Configure appropriate disk scheduler Check the error log for "Buffer pool load complete" 26
27 MySQL, MySQL! What Have Suffereth Ye Thee?
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About Slow queries Connections build up Slow response times Long running transactions Stop the World scenario 28
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About --innodb-- txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s) 0 queries inside InnoDB, 0 queries in queue Main thread: sleeping, pending reads 0, writes 28, flush 1 Log: lsn = 2147483647, chkp = 2147483647, chkp age = 210625191 29
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About ---TRANSACTION 230207990, ACTIVE 13779 sec fetching rows mysql tables in use 1, locked 1 80337 lock struct(s), heap size 8271400, 10979242 row lock(s) MySQL thread id 671621, OS thread handle 0x7fe03528a700, query id 37505085 localhost magento Sending data SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item` LIMIT 376 OFFSET 491056 30
MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > Solutions KILL long running trx pt-kill for persistent long running trx Deploy immediate code changes to disable erroring code 31
MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > About MySQL is still responding All sorts of mutexes trx_sys->mutex block->lock lock_sys->mutex lock_sys->wait_mutex and is killing latency Service impact means lost income 32
MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > Solutions innodb_thread_concurrency > 0 33
MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About Opening tables, Closing tables --processlist-- State 578 Opening tables 32 closing tables 34
MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About Contention on LOCK_open mutex Risk of negative scalability 35
MySQL, MySQL! What Have Suffereth Ye Thee? (3) : CPU Load > Solutions Tune table_open_cache/table_definition_cache table_open_cache_instances (5.6+) Shard either logically/horizontally, run multiple mysql instances to reduce object size by instance 36
MySQL, MySQL! What Have Suffereth Ye Thee? (2,3) : Prevention pt-kill --log MySQL Server Configuration a. Remember to tune innodb_thread_ concurrency (default is 0) b. innodb_table_cache + innodb_table_cache_instances Application Stack Configuration (Schema Design) a. Single tenant per schema b. Multiple tenants per schema (each table has client_id column) c. All tenants in one schema 37
Wizard of OS (1): Disk Performance Disk performance cascading to MySQL to application 38
Wizard of OS (1): Disk Performance > About Slow writes, binlogs, redo logs, syncs Transactions stalling on COMMIT, updating, inserting Replication getting delayed if node is a slave Translates to latency 39
Wizard of OS (1): Disk Performance > Solutions RAID Controller in Write-Through Could also be a bad disk! 40
Wizard of OS (2): Swapping Swapping heavily, with significant amount of RAM free 41
Wizard of OS (2): Swapping > About Swapping induces significant amount of IO Swapping in and out of disk is mighty expensive Affects MySQL in magnificent ways Swap Insanity! 42
Wizard of OS (2): Swapping > Solutions NUMA Interleave Percona Server is NUMA configurable numa_interleave Flush_caches Check numastat - perl check_numa.pl 43
Wizard of OS : Prevention Tune: Vm.swappiness NUMA policy disk scheduler mount options appropriately (ext4, xfs) (nobarrier, noatime) pt-heartbeat - monitor replication delay 44
Percona Server Features Enable InnoDB Buffer Pool warming Enable userstat for table & index statistics Enable verbose slow log Enable Query Response Time plugin 45
Thank You! Jervin Real jervin.real@percona.com Technical Services Manager, APAC Michael Coburn michael.coburn@percona.com Principal Technical Account Manager, USA 46