PostgreSQL to MySQL A DBA's Perspective Patrick King @mr_mustash
Yelp s Mission Connecting people with great local businesses.
My Database Experience Started using Postgres 7 years ago Postgres 8.4 (released in 2009) 50+ TB OLAP database Started using MySQL 11 months, 22 days, and 2 hours ago First time using another RDBMS in a professional setting
Replication Schema Changes Query Plans Indexing Topics I'll Cover Today
Topics I Won't Cover Today
There is more that unites us than divides us.
MySQL at Yelp Monolithic LAMP stack dating back to 2004 Moving features and data out from the monolith and into services Hundreds of DBs/Schemas 15+ Schema Changes each week Achieved by using pt-osc 400 engineers with 100 interns, and 4 DBAs
MySQL at Yelp MySQL 5.6 Statement based replication Replication trees that are up to 5 nodes deep One "intermediate master" per datacenter Vertical sharding at the database level, no horizontal sharding of data across multiple machines No site downtime allowed for database maintenance Online re-mastering of a database cluster
MySQL at Yelp: Surprises No physical sharding or partitioning? Largest single table is 4B+ rows The number of schema changes we do each week Nested replication hierarchy MySQL replication in general
Postgres at Yelp Used by both Eat24 and Yelp Reservations Postgres 9.5 and 9.6 Monolithic data, very few services or service databases
Replication
Replication: Postgres Streaming replication Replicas are byte-for-byte copies of the master database Replicas are fully read-only hot_standby_feedback Write-Ahead Log (WAL) is used for both replication and crash recovery
Replication: MySQL Statement based replication Each insert/update/delete is logged in the binary logs after it is committed Replica pull changes from the binary logs and runs the same SQL statement No other communication between master and replica Allows for awesome architecture designs where replicas have partial data, or different indexes
Replication: Lessons Learned MySQL replicas only receive transaction after it has been committed on the master Long running statements (like update on non-indexed WHERE clauses) can take forever on the master, and then take forever on each replica
Replication: Lessons Learned MySQL Statement Based vs Postgres replication delay Frequent cause in MySQL: Large insert/update/deletes on the master database being ran by all replicas Frequent cause in Postgres: Long running selects on the replica locking rows/tables that need to be updated
Replication: Lessons Learned Long running transactions on the replica in Postgres cause the master to slow down due to hot_standby_feedback Long running transaction on the replica in MySQL cause replication delay on the replica
Schema Changes
Schema Changes: Postgres Most changes can just be performed with minimal table locking or replication concerns This is because Postgres is using WAL replication, so on-disk changes are shipped over to the replicas while they're happening on the master
Schema Changes: Postgres Exceptions to this include: Adding a column with a default value Changing a column type Adding an index (Use CREATE INDEX CONCURRENTLY instead)
Schema Changes: MySQL Tools like the pt-online-schema-change or gh-ost are required for safe changes during online operations MySQL does have some online schema changes, but we chose to use pt-osc for tables over 100MB
Schema Changes: MySQL This is especially true if using statement based replication as the ALTER statement will only be shipped to replicas after it completes on the master See Jenni Snyder's PL16 Talk "Let Robots Manage your Schema (without destroying all humans)"
Schema Changes: Lessons Learned There's no one correct way to do schema changes Pick the tool and method that are best for your environment
Query Plans
create table species_groups ( id_no serial PRIMARY KEY, species varchar(64) NOT NULL ); create table doctors ( id_no serial PRIMARY KEY, name varchar(64) NOT NULL, hire_date date NOT NULL, termination_date date ); create table doctors_species_groups ( doctor_no integer, species_groups_no );
select t1.name, t3.species FROM doctors t1 INNER JOIN doctors_species_groups t2 ON t1.id_no = t2.doctor_no INNER JOIN species_groups t3 on t2.species_groups_no = t3.id_no ;
pking@[local]:5432 [test]=# explain select t1.name, t3.species FROM doctors t1 INNER JOIN doctors_species_groups t2 ON t1.id_no = t2.doctor_no INNER JOIN species_groups t3 on t2.species_groups_no = t3.id_no; QUERY PLAN ------------------------------------------------------------------------------- -- Hash Join (cost=24.34..51.24 rows=143 width=152) Hash Cond: (t2.species_groups_no = t3.id_no) -> Hash Join (cost=4.22..29.15 rows=143 width=10) Hash Cond: (t1.id_no = t2.doctor_no) -> Seq Scan on doctors t1 (cost=0.00..16.00 rows=1000 width=10) -> Hash (cost=2.43..2.43 rows=143 width=8) -> Seq Scan on doctors_species_groups t2 (cost=0.00..2.43 rows=143 width=8) -> Hash (cost=14.50..14.50 rows=450 width=150) -> Seq Scan on species_groups t3 (cost=0.00..14.50 rows=450 width=150) (9 rows) Time: 2.681 ms
pking@[local]:5432 [test]=# explain analyze select t1.name, t3.species FROM doctors t1 INNER JOIN doctors_species_groups t2 ON t1.id_no = t2.doctor_no INNER JOIN species_groups t3 on t2.species_groups_no = t3.id_no; QUERY PLAN ---------------------------------------------------------------------------------------------- ------------------------------------------- Hash Join (cost=24.34..51.24 rows=143 width=152) (actual time=0.354..0.354 rows=0 loops=1) Hash Cond: (t2.species_groups_no = t3.id_no) -> Hash Join (cost=4.22..29.15 rows=143 width=10) (actual time=0.059..0.330 rows=143 loops=1) Hash Cond: (t1.id_no = t2.doctor_no) -> Seq Scan on doctors t1 (cost=0.00..16.00 rows=1000 width=10) (actual time=0.010..0.120 rows=1000 loops=1) -> Hash (cost=2.43..2.43 rows=143 width=8) (actual time=0.041..0.041 rows=143 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 14kB -> Seq Scan on doctors_species_groups t2 (cost=0.00..2.43 rows=143 width=8) (actual time=0.007..0.019 rows=143 loops=1) -> Hash (cost=14.50..14.50 rows=450 width=150) (actual time=0.007..0.007 rows=6 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on species_groups t3 (cost=0.00..14.50 rows=450 width=150) (actual time=0.003..0.005 rows=6 loops=1) Planning time: 0.191 ms Execution time: 0.376 ms (13 rows)
pking@[local]:5432 [test]=# explain (analyze, buffers) select t1.name, t3.species FROM doctors t1 INNER JOIN doctors_species_groups t2 ON t1.id_no = t2.doctor_no INNER JOIN species_groups t3 on t2.species_groups_no = t3.id_no; QUERY PLAN ---------------------------------------------------------------------------------------------- ------------------------------------------- Hash Join (cost=24.34..51.24 rows=143 width=152) (actual time=0.306..0.306 rows=0 loops=1) Hash Cond: (t2.species_groups_no = t3.id_no) Buffers: shared hit=8 -> Hash Join (cost=4.22..29.15 rows=143 width=10) (actual time=0.050..0.277 rows=143 loops=1) Hash Cond: (t1.id_no = t2.doctor_no) Buffers: shared hit=7 -> Seq Scan on doctors t1 (cost=0.00..16.00 rows=1000 width=10) (actual time=0.007..0.105 rows=1000 loops=1) Buffers: shared hit=6 -> Hash (cost=2.43..2.43 rows=143 width=8) (actual time=0.036..0.036 rows=143 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 14kB Buffers: shared hit=1 -> Seq Scan on doctors_species_groups t2 (cost=0.00..2.43 rows=143 width=8) (actual time=0.004..0.014 rows=143 loops=1) Buffers: shared hit=1 -> Hash (cost=14.50..14.50 rows=450 width=150) (actual time=0.007..0.007 rows=6 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=1
mysql> explain select t1.name, t3.species FROM doctors t1 INNER JOIN doctors_species_groups t2 ON t1.id_no = t2.doctor_no INNER JOIN species_groups t3 on t2.species_groups_no = t3.id_no; +----+-------------+-------+------------+--------+---------------+---------+---------+-------- -------------------+------+----------+-------------+ id select_type table partitions type possible_keys key key_len ref rows filtered Extra +----+-------------+-------+------------+--------+---------------+---------+---------+-------- -------------------+------+----------+-------------+ 1 SIMPLE t2 NULL ALL NULL NULL NULL NULL 100 100.00 Using where 1 SIMPLE t3 NULL eq_ref PRIMARY,id_no PRIMARY 8 test.t2.species_groups_no 1 100.00 Using where 1 SIMPLE t1 NULL eq_ref PRIMARY,id_no PRIMARY 8 test.t2.doctor_no 1 100.00 Using where +----+-------------+-------+------------+--------+---------------+---------+---------+-------- -------------------+------+----------+-------------+ 3 rows in set, 1 warning (0.00 sec)
mysql> explain FORMAT=JSON select t1.name, t3.species FROM doctors t1 INNER JOIN doctors_species_groups t2 ON t1.id_no = t2.doctor_no INNER JOIN species_groups t3 on t2.species_groups_no = t3.id_no; { "query_block": { "select_id": 1, "cost_info": { "query_cost": "261.00" }, "nested_loop": [ { "table": { "table_name": "t2", "access_type": "ALL", "rows_examined_per_scan": 100, "rows_produced_per_join": 100, "filtered": "100.00", "cost_info": { "read_cost": "1.00", "eval_cost": "20.00", "prefix_cost": "21.00", "data_read_per_join": "1K" }, "used_columns": [ "doctor_no", "species_groups_no" ], "attached_condition": "((`test`.`t2`.`species_groups_no` is not null) and (`test`.`t2`.`doctor_no` is not null))" } }, { "table": { "table_name": "t3", "access_type": "eq_ref", "possible_keys": [ "PRIMARY", "id_no" ], "key": "PRIMARY", "used_key_parts": [ "id_no" ], "key_length": "8", "ref": [ "test.t2.species_groups_no" ], "rows_examined_per_scan": 1, "rows_produced_per_join": 100, "filtered": "100.00",
Query Plans: Lessons Learned Learning to read query plans correctly is hard for any database MySQL: Baron Schwartz's "EXPLAIN Demystified" Postgres: Josh Berkus "Explain Explained"
Index Types and Indexing Strategies
Indexes: Postgres B-tree GiST SP-GiST GIN BRIN Hash
Indexes: Postgres Functional Indexing Useful to speed up particularly complex queries Indexing on the WHERE clause of a given query CREATE INDEX order_not_completed ON orders USING btree (restaurant_id, creation_date) WHERE ((paid = 0) AND (payment_id IS NULL))
Indexes: Postgres You will often find more indexes than the number of columns in a table Postgres is already optimized for rewriting data all the time, which is why the cost of having so many indexes isn't cumbersome
Indexes: MySQL InnoDB has clustered indexing Long primary keys are bad and affect performance on all indexes because of clustered indexing
Community Postgres doesn't have the equivalent of the MySQL Utilities, percona toolkit, or just searching GitHub for MySQL While there are big Postgres consulting companies there is no one company driving the major changes No official bug tracker in Postgres Almost all communication is done on the official Postgres email lists
Thing I Miss from Postgres Flexible Indexing Transactional DDL In-database online schema changes WAL-style replication
Things I wish Postgres had from MySQL Sub-millisecond, primary key selects on large tables Community support Replication flexibility
Questions?
fb.com/yelpengineers @YelpEngineering engineeringblog.yelp.com github.com/yelp
We're Hiring! www.yelp.com/careers/