Strategies for MySQL performance and fault tolerance

Size: px

Start display at page:

Download "Strategies for MySQL performance and fault tolerance"

Elvin Thomas
5 years ago
Views:

1 Strategies for MySQL performance and fault tolerance You already have complexity in your application tier -- keep it there and relegate MySQL to a simple, durable record store. For fault tolerance, the strategies and pitfalls of various replication/cluster configurations are discussed Ryan Mitchell <rjm@tcl.net> Telecom Logic, LLC. Reproduction prohibited without consent from author. Table of Contents Preface...1 Design a simple data model...1 Propose a replication and fail-over scheme...3 Failure modes...6 Audit your queries...7 Specific example of MySQL setup and hardware...8 Client logic for fail-over and load-balancing...9 Additional advice for high availability...9 Conclusion...10 Preface This article will help you achieve performance, scalability, and fault tolerance for your database driven application by looking at design choices for your application layer, data models, load-balancing and fail-over strategies. Recent (c. 2011) releases and distributions of MySQL server come with reasonable default settings that will give you good performance on modern hardware for most applications out there in the wild, with the exception of innodb_buffer_pool set needlessly low given the abundance of server memory most machines have. From that starting point, there are many websites that provide loads of information on tuning the myriad of MySQL and OS variables, as well as strategies for building your web scale application. One of my favorites is It's easy and fun to get lost delving into the depths of performance tuning. But there's a good chance it's unneeded and you should really just get along with programming the rest of your application. So, when should you dive in? Don't optimize too early; however do read this article and put a lot of design into your application layers and data models that will enable scalability and performance. This article assumes a basic familiarity with MySQL replication modes and n-tier application design. Design a simple data model The most important advice for enabling database performance and scalability is to keep your data model and queries as simple as possible. Your application tiers are already complex; keep the complex

data models and operations there, where it belongs. Take a page from the no-sql camp and try to relegate your database to a simple, durable record store.

2 data models and operations there, where it belongs. Take a page from the no-sql camp and try to relegate your database to a simple, durable record store. Avoid complex table relationships (don't try to mirror your class hierarchy, or attempt complex mapping with an ORM system). Avoid stored procedures and triggers, except in trivial cases to enforce basic data sanity. These are design goals that are easier stated than realized, not hard requirements, but will pay dividends when your application needs to scale and perform at Facebook levels. Why you want a simplified database tier is easy to justify: you're offloading work from your database engine, and a simple system is easier to understand and hence manage and optimize. How you get to that optimum will be harder, especially if you have an existing spaghetti monster of a data model in your application. If it's early in the design cycle, fight for simplicity everywhere you can. Don't overengineer, and don't fall into the trap of using design patterns just for the sake of using them. The main strategy is this: Concentrate code that maps your application data model to a simpler one in a data access object (DAO) layer. If it's ugly code, fine, you can clean it up later, but you'll be happy to have it in just one place. When designing class hierarchies, think about how you can map objects into simple recordswith-properties that can be handled by external systems. You may already be forced into this step because you have a public RESTful/JSON service allowing access to services and record manipulation. Or, you are using an external caching system that requires you to serialize your objects in a format that other processes can make use of. In those cases, JSON is a good format because you are using a type system that is directly mappable to SQL data types now you can write a database schema that simply stores those JSON attributes. The complexity in your code is often unavoidable; for various reasons (examples given above) you will need a simpler and externalizable data model. Use this to your advantage in the database tier and forget about how Enterprise object persistence should be done. Your DAO system allows you to deal with this mapping in a single place. Once you have this working, you now have the proper place in your code where you can add caching and take advantage of all the copious memory your servers have. Caching is a non-trivial endeavor when you have to take into account transactions and high concurrency; there are now lots of good products and tutorials available to help you on your way. Partial system diagram that illustrates how you need to build a layer that maps your application data model into something simpler digestible by external systems.

3 The most important feature you can add to a DAO system outside of caching, for database performance, is concurrency control. The primary reason to manage concurrency at the DAO level is because the disk-based database system doesn't scale with concurrent queries the same way your CPU does crunching through L1 cached code and data. You could configure MySQL to buffer your entire data set, but for the sake of argument let's assume that it's not, or is not 100% of the time. Whereas 10 CPU bound threads may each run at 1/10th the speed multitasking on a single core, that kind of concurrent disk access may grind your database to a crawl. For example, a simple query to update a single record may take 20ms on average to complete, but if a long-running OLAP query comes along (thrashing buffers and disk read/write heads), your previously query could take much longer than 40ms, the time you'd expect in the ideal case. Simplified picture to illustrate the point: your application, using CPU resources, is better suited to handle concurrency than an RDBMS dealing with limited disk I/O. Coding your DAO system to limit concurrency allows you to ensure the maximum (or fair) performance for the most amount of requests, and avoid the limitations of the RDBMS's and operating system's I/O schedulers. Study the Actor model for the basic design idea; limited sized thread pools and message queues are the key components. Depending on who is requesting, you can prioritize queries as needed. This gives you some built-in protection against a denial-of-service attack designed to overload your resources. Side node on concurrency and user interfaces: perform operations asynchronously as much as possible. If the user/client doesn't absolutely need to hang around for a transaction to finish, don't force them to. Put requests into priority queues and let your DAO system manage them as it sees fit. Many of these ideas could be implemented in a feature rich RDBMS, but since you're implementing many of them in your application anyway, don't. (If your application is simple enough for a 2-tier architecture, then this article is overkill).

4 Propose a replication and fail-over scheme Now that your data model is as simple as possible, you need to think about how your software will query that data in the database. The types of queries performed, access patterns, and transactional requirements strongly shape your options for clustering and tuning. Scenario #1: hard transactional and consistency requirements, low read/write ratio When the software commits a transaction, data must absolutely not be lost, and immediate subsequent reads from any database server in the cluster must provide consistent results. In this case you will have limited write concurrency available. In fact, the more replicated database servers you have, the slower the system becomes since transactions need to complete everywhere to ensure consistency (you will need the MySQL cluster product if you truly need synchronous replication). Even if a single server is powerful enough to handle the load, you still need to invest in a replicated secondary server to use in case of fail-over. This could be a banking application, or it could describe a subset of your system. Scenario #2: eventual consistency is tolerated, good read/write ratio You want committed writes to make it to disk 99.99% of the time, but are willing to compromise some data safety for performance. In this case, read-only database slaves can be added as needed to handle the load. Multiple masters can be used to load balance writes as well. After a write on one of the masters, the database cluster eventually becomes consistent most of the time within a second, but sometimes much longer, and this is not a major problem. The starting point for your replication and fail-over scheme will be a simple master-master 2 node cluster, using MySQL replication. This setup benefits either scenario, and is the simplest setup that allows you to easily fail-over from one server to the other as needed. The following advice and usecases are noted: (1) perform all or most transactional (OLTP) queries, such as simple record updates and reads, on one of the servers; call this server the primary, and strive to keep it up and running all year long. (2) perform longer running OLAP style queries on the other server (call this server the secondary), which will be unloaded most of the time. This prevents interfering with the disks and buffers on the primary. OLAP queries are typically from report generators, or scripts that do batch updates, and don't require the up-to-the-millisecond latest data from the primary. (3) perform backups and other disk-intensive maintenance jobs using the secondary server, which you can take down for full backups as needed. (4) If the master goes down, or is overloaded, clients can start using the secondary immediately; this is in contrast to a hot standby server, that needs to switch modes to become the master (at which point the original master will be no longer available). A note on primary keys when using multiple masters: MySQL has a great feature that allows each node in a cluster of servers to generate unique auto_increment values for each node. See the documentation for the configuration property auto_increment_increment. If you will not use auto_increment columns in your schema, you must devise a scheme to generate unique primary keys no matter which server you are connected to. For example, you could use Java's UUID generators. Without taking this into consideration, you may run into duplicate primary key collisions when multiple clients used multiple servers for inserting new rows. If your data usage patterns are such that you have many more writes than reads, and eventual

consistency is tolerated, then you can easily add more replicated nodes to your MySQL cluster. Given that you already have two read/write nodes, you will probably only need to add read-only slaves.

5 consistency is tolerated, then you can easily add more replicated nodes to your MySQL cluster. Given that you already have two read/write nodes, you will probably only need to add read-only slaves. Keep in mind that every write that occurs on a master needs to also be written to the read-only slaves, so the slaves still need to have the performance to handle a lot of writes, though it is noted that the load on the read-only slaves will often be less, especially when binary logging is disabled. If you absolutely need more write performance than a single node can handle, even after good hardware choices and server tuning, there are still options. Look at which tables and disks are getting hammered, and then setup another physical disk and move around physical tables to balance the I/O. Next, if needed try to partition your application into one database that handles your non-critical queries and another database that handles the critical ones. Chances are there are only limited data sets and queries where your hard transactional requirements are needed. You will need to reply on explicit twophase committing to coordinate transactions across databases. If you still exceed the performance available in a single server for write transactions, consider using an in-memory database that uses a networked cluster of replicated caches to ensure durability. The following diagram illustrates how I/O and processing can be broken up. Failure modes To augment the previous section, we will look at the most common types of failure you are likely to encounter, and how your architecture choices help or hinder you.

6 (1) Your secondary master goes down, because you took it down for maintenance, such as when you need to do a full offline backup. There are ways to do full backups without needing to stop the MySQL server, but many sites just stop the server for a few minutes while all the data files are tarballed away somewhere safe. Or, for any other planned maintenance window you have; just assume that many times during the year one of your servers will need to be taken offline. The multi-master design handles this case perfectly well. Your primary master stays up the vast majority of the time, and performance and your application will usually not be affected, since that depends largely on the primary master. Planned maintenance should be done during times of low load. (2) One of your MySQL nodes is degraded or inoperable, but otherwise stays up. For example, a disk drive fills up, and MySQL refuses to service any queries until the disk space frees up. Repairing the degraded node is often a manual procedure, and could take some time to discover and deal with. It is very important that your monitoring scripts can detect degraded states, and not just dead services. Below in the section on client logic, we discuss monitoring scripts and fail-over techniques. (3) Replication on one or more slaves stops due to some logical error. For example, replication has gotten out of sync, and the secondary master and read-only slaves do not have consistent data with respect to each other. Depending on how badly your nodes are out of sync, you may need to rebuild the slaves from the primary master's data. This is a time-consuming and probably manual process. Replication status between nodes is clearly something you will want to look at with your monitoring scripts. (4) Human or application error: an evil or otherwise erroneous SQL statement corrupts or deletes a lot of data. This is the worst kind of failure, because it can be indistinguishable from normal database operations, and the problem gets replicated to all of you nodes and eventually onto backup files. (Which, incidentally, is why you want to keep multiple backup archives from various points in time). There is nothing in this architecture or any other that protects you from this type of event. The way to protect yourself is to have the master's binary logfiles ready and available to help do a point-in-time recovery. (5) Your website and application get hammered either by hoards of adoring fans, or from a denialof-service (DoS) attack. There is no easy solution to massive DoS attack, but certainly the whole point of your architecture is to help you scale to handle the Facebook type of loads you hope will come. With proper load balancing and available slave nodes in addition to effective application-level caching as described above, you should be able to weather large storms. If you have huge, unpredictable spikes in load, consider deploying your database cluster in a virtual server farm, where you can spin-up new read-only slaves quickly. (6) You have some kind of hardware failure. When choosing your hardware in the first place, if you value your time, spend money on high quality hardware with lots of redundancy. We have had an excellent track record of minimal hardware failures using HP Proliant DL380 servers for databases. Use serial attached SCSI (SAS) drives, and always install them in mirrored pairs. With continuous SNMP monitoring of the hardware, you can usually catch and replace a failed disk drive before it's mirror also fails. Less frequent than disk drives failures are fans and power supplies. Again, these are installed redundantly and can be replaced before they cause you any downtime. The multi-master and multiple read-only slave setup protects you against the most common types of hardware failures. It is unlikely that both masters will encounter hardware failures simultaneously, unless there is cause by a major external event that affects the entire rack in your data center (but, in that case your application and everything else is dead anyway).

7 Audit your queries In as far as you should be querying the database as little as possible since you've followed the advice above regarding data modeling and application-level caching and processing, you will still need to put on your DBA hat and look at the patterns of queries are being sent to the database you can refine your tuning and physical partitioning. Know about, and apply the traditional performance techniques of auditing your queries to make sure you're using indicies when appropriate, employ schema denormalization to improve performance, and take careful measure to ensure that certain transactions aren't holding locks longer than they need to. However, to hammer in the point, your best strategy starts with maintaining a simple data model in the database, so your job to audit and tune queries should be a simple task. A primary product of analyzing your queries and data access patterns is to determine how you can partition some of your data to optimize available I/O to multiple disks. This could mean splitting up a huge logical table into separate physical ones, or arranging your tables across multiple disks so querying against one wont impact the performance of querying others. For some heavily use tables, consider giving them their own dedicated drive. Place logfiles on separate drives. If you are running MySQL on Linux, install atop to look at resource usage of all your disk drives during heavy loads; this tool comes in handy to determine where your I/O or CPU bottlenecks are. Another product of query analysis is to determine if many concurrent queries are causing cache/buffer misses and resulting in an inordinate amount of disk I/O for other queries. If this is the case, and you've already maximized use of MySQL InnoDB buffer space, then go back to your code and limit concurrency as explained above. Be aware that a common performance problem is caused by a small number of queries that scan large tables and trash the buffers, so other queries that really should be using the buffers are slowed down. See the article here on the subject of innodb_old_blocks_time for a way to mitigate this problem. Specific example of MySQL setup and hardware Incorporating the strategies and advice above for the database tier, here is a specific example showing key configuration elements: hardware for read/write masters: HP Proliant DL380 Dual E5440 Xeon processors, 16GB of RAM. Use a SCSI array controller with a battery-back write cache, and 8 disks configured as 4 mirrored pairs to be used as follows: drive pair #1: operating system, MySQL temporary tables and transaction rollback logs (innodb_log_group_home_dir), plus temporary space to make copies of binary logs. drive pair #2: binary log files (log_bin). drive pair #3: InnoDB files for certain critical tables that are accessed and/or scanned frequently I/O access that would otherwise impact performance to other tables were they to share the same disk. drive pair #4: the remainder of InnoDB tables and all other MySQL database files.

8 Use nearly all available RAM for the InnoDB buffer pool (innodb_buffer_pool_size), assuming this server is dedicated to MySQL. Set innodb_old_blocks_time and innodb_old_blocks_pct so OLAP table scans don't kill the benefits of your large InnoDB buffer pool. To avoid double buffering, set innodb_flush_method=o_direct. On the primary, innodb_flush_log_at_trx_commit is set to 1. On the secondary, the value is relaxed to 2. hardware for read-only slaves: HP Proliant DL360 Same configuration as above with few disks, e.g.: drive pair #1: operating system, all MySQL database files and InnoDB tables except the critical ones. drive pair #2: critical InnoDB table files Client logic for fail-over and load-balancing Having multiple read/write masters and other slaves to load balance between requires complex logic for clients to decide which servers to use for which queries, and which servers are alive and happy. You could write your own database code on the client to handle this or, more likely, reply on the services provided by the application server you're using. To implement all of the complex logic to make things reliable may quickly exceed the capabilities of your app container or code you're willing to cram into each client. Since you will probably have multiple clients from different platforms using different languages, it makes more sense to concentrate this logic into a load balancer that performs intelligent monitoring of your MySQL servers. The load balancer haproxy, coupled with some intelligent monitoring scripts, can do a great job of transparently providing an always-on connection to nodes in your MySQL cluster. It's not a trivial solution, but is better than alternatives. Keep in mind the following: (1) Write monitoring scripts to keep haproxy apprised of the status of your database nodes. Simply checking that the MySQL service is up is not sufficient it is often the case that MySQL will accept connections but is slow in servicing them. MySQL could be deadlocked with competing transactions trying to commit, or it could be overloaded and very slow due to I/O contention, or the replication could be slow or very delayed. All of these cases are things you need to explicitly check for; then you can ensure that haproxy will direct your clients to the best possible server. (2) Your clients must be configured to not hold onto connections for too long. haproxy balances TCP connections, and only does this when your client makes a new connection to the database. You still don't want your clients to open & close database connections after every query, but limit this caching to under a minute or so. (3) Configure haproxy to group masters and slaves into port groups so clients, knowing which type of query they are about to execute (OLAP, OLTP, read-only) can select the preferred group. The following illustration shows some of these concepts.

Additional advice for high availability The primary way to achieve high availability for any engineering system is to expect failure and then deal with it gracefully.

9 Additional advice for high availability The primary way to achieve high availability for any engineering system is to expect failure and then deal with it gracefully. When (not if) your primary master experiences a major failure, you must be prepared to get it back up quickly, and you must have practiced this before the first time it happens. Assume you need to do a point-in-time recovery from an older backup and archived binlogs. How quickly do you think you can make this happen? Here are a few questions (without answers!) to address: how many binlogs to you need to roll forward through? Do you really know how long that will take? (Hint: it could take all day, depending on what's in those binlogs and when the last full backup was completed). Are your backups and binlogs (and backups of the binlogs) readily accessible? If you need to untar some large backup files, do you have a fast and spacious place to do that? If the secondary master replication gets out of sync with the primary, or one of the read-only slaves needs to be rebuilt, do you know how to do this quickly? While you are performing one of these types of recoveries, do your haproxy monitoring scripts know what's going on and can rebalance the load accordingly? Rogue queries that corrupt data may go undetected for a very long time, and will corrupt your backups along with everything else; do you keep old backups so you can rollback to a point before the problem started? Conclusion This article covered almost more space to software architecture than to MySQL design, and that captures the essential point. High complexity already exists, unavoidably, in your application tiers; avoiding a new layer of complexity in the database allows you to get great performance and scalability out of a minimal amount of servers. Avoid database performance problems by not creating them in the first place: aim to use your database as a simple and reliable read/write store for records, and leave data

10 and processing complexity in the application tiers. If you can follow that advice, or at least get 80% towards that ideal, then the servers and MySQL configuration strategies mentioned here will probably be overkill.

MySQL Database Scalability

MySQL Database Scalability Nextcloud Conference 2016 TU Berlin Oli Sennhauser Senior MySQL Consultant at FromDual GmbH oli.sennhauser@fromdual.com 1 / 14 About FromDual GmbH Support Consulting remote-dba