Percona XtraDB Cluster Tutorial

Size: px

Start display at page:

Download "Percona XtraDB Cluster Tutorial"

Moris Wood
6 years ago
Views:

1 Percona XtraDB Cluster Tutorial Percona, Inc. 1 / 102

2 Table of Contents 1. Overview & Setup 6. Monitoring Galera 2. Migrating from M/S 3. Galera Replication Internals 7. Incremental State Transfer 8. Node Failure Scenarios 4. Online Schema Changes 9. Avoiding SST 5. Application HA 10. Tuning Replication Percona, Inc. 2 / 102

3 XtraDB Cluster Tutorial OVERVIEW Percona, Inc. 3 / 102

4 Resources Percona, Inc. 4 / 102

5 Standard Replication Replication is a mechanism for recording a series of changes on one MySQL server and applying the same changes to one, or more, other MySQL server. The source is the "master" and its replica is a "slave." Percona, Inc. 5 / 102

6 Galera Replication In Galera, the dataset is synchronized between one or more servers. You can write to any node; every node stays in sync Percona, Inc. 6 / 102

7 Our Current Setup Single master node; Two slave nodes node1 will also run sysbench Sample application, simulating activity (*) Your node IP addresses will probably be different Percona, Inc. 7 / 102

8 Verify Connectivity and Setup Connect to all 3 nodes Use separate terminal windows; we will be switching between them frequently. Password is vagrant ssh -p vagrant@localhost ssh -p vagrant@localhost ssh -p vagrant@localhost Verify MySQL is running Verify replication is running SHOW SLAVE STATUS\G Percona, Inc. 8 / 102

9 Run Sample Application Start the application and verify node1# run_sysbench_oltp.sh Is replication flowing to all the nodes? Observe the sysbench transaction rate Percona, Inc. 9 / 102

10 Checking Consistency Run a pt-table-checksum from node1 node1# pt-table-checksum -d test Check for differences node3> SELECT db, tbl, SUM(this_cnt) AS numrows, COUNT(*) AS chunks FROM percona.checksums WHERE master_crc <> this_crc GROUP BY db, tbl; See all the results node3> SELECT * FROM percona.checksums; Are there any differences between node1 and its slaves? If so, why? Percona, Inc. 10 / 102

11 Monitoring Replication Lag Start a heartbeat on node1 node1# pt-heartbeat --update --database percona \ --create-table --daemonize Monitor the heartbeat on node2 node2# pt-heartbeat --monitor --database percona \ --master-server-id=1 Stop the heartbeat on node1 and see how that affects the monitor. STOP SLAVE on node2 for a while and see how that affects the monitor Percona, Inc. 11 / 102

12 XtraDB Cluster Tutorial MIGRATING FROM MASTER/SLAVE Percona, Inc. 12 / 102

13 What is Our Plan? We have an application working on node1. We want to migrate our master/slave to a PXC cluster with minimal downtime. We will build a cluster from one slave and then add the other servers to it one at a time Percona, Inc. 13 / 102

14 Upgrade node3 node3# systemctl stop mysql node3# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57 node3# systemctl start mysql node3# mysql_upgrade node3# mysql -e "show slave status\g" node3# systemctl stop mysql -- For the next steps The Percona XtraDB Cluster Server and Client packages are drop-in replacements for Percona Server and even community MySQL Percona, Inc. 14 / 102

15 Configure Node3 for PXC [mysqld] # Leave existing settings and add these binlog_format = ROW # galera settings wsrep_provider = /usr/lib64/libgalera_smm.so wsrep_cluster_name = mycluster wsrep_cluster_address = gcomm://node1,node2,node3 wsrep_node_name = node3 wsrep_node_address = wsrep_sst_auth = sst:secret innodb_autoinc_lock_mode = 2 innodb_locks_unsafe_for_binlog = ON Start node3 with: systemctl start mysql@bootstrap Percona, Inc. 15 / 102

16 Checking Cluster State Can you tell: How many nodes are in the cluster? Is the cluster Primary? Are cluster replication events being generated? Is replication even running? Percona, Inc. 16 / 102

17 Replicated Events Cannot async replicate to node3! What would the result be if we added node2 to the cluster? What can be done to be sure writes from replication into node3 are replicated to the cluster? (*) Hint: If you ve ever configured slave-of-slave in MySQL replication, you know the fix! Percona, Inc. 17 / 102

18 Fix Configuration Make this change: [mysqld]... log-slave-updates... Restart MySQL # systemctl restart mysql@bootstrap Why did we have to restart with bootstrap? Percona, Inc. 18 / 102

19 Fix Configuration Make this change: [mysqld]... log-slave-updates... Restart MySQL # systemctl restart mysql@bootstrap Why did we have to restart with bootstrap? Cause this is still a cluster of 1 node Percona, Inc. 19 / 102

20 Checking Cluster State Again Use 'myq_status' to check the state of Galera on node3: node3# /usr/local/bin/myq_status wsrep Percona, Inc. 20 / 102

21 Upgrade node2 We only need a single node replicating from our production master. node2> STOP SLAVE; RESET SLAVE ALL; Upgrade/Swap packages as before node2# systemctl stop mysql node2# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57 Don't start mysql yet! Edit node2's my.cnf as you did for node3 What needs to be different from node3's config? Percona, Inc. 21 / 102

22 node2's Config [mysqld] # Leave existing settings and add these binlog_format log-slave-updates = ROW # galera settings wsrep_provider = /usr/lib64/libgalera_smm.so wsrep_cluster_name = mycluster wsrep_cluster_address = gcomm://node1,node2,node3 wsrep_node_name = node2 --!!! CHANGE!!! wsrep_node_address = !!! CHANGE!!! wsrep_sst_auth = sst:secret innodb_autoinc_lock_mode = 2 innodb_locks_unsafe_for_binlog = ON Don't start MySQL yet! Percona, Inc. 22 / 102

23 What's going to happen? Don't start MySQL yet! When node2 is started, what's going to happen? How does node2 get a copy of the dataset? Can it use the existing data it already has? Do we have to bootstrap node2? Percona, Inc. 23 / 102

24 State Snapshot Transfer (SST) Transfer a full backup of an existing cluster member (donor) to a new node entering the cluster (joiner). We configured our SST method to use xtrabackup-v Percona, Inc. 24 / 102

25 Additional Xtrabackup SST Setup What about that sst user in your my.cnf? [mysqld]... wsrep_sst_auth = sst:secret... User/Grants need to be added for Xtrabackup node3> CREATE USER 'sst'@'localhost'; node3> ALTER USER 'sst'@'localhost' IDENTIFIED BY 'secret'; node3> GRANT PROCESS, RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sst'@'localhost'; The good news is that once you get things figured out the first time, it's typically very easy to get an SST the first time on subsequent nodes Percona, Inc. 25 / 102

26 Up and Running node2 Now you can start node2 node2# systemctl start mysql Watch node2's error log for progress node2# tail -f /var/lib/mysql/error.log Percona, Inc. 26 / 102

27 Taking Stock of our Achievements We have a two node cluster: Our production master replicates, asynchronously, into the cluster via node3 What might we do now in a real migration before proceeding? Percona, Inc. 27 / 102

28 Finishing the Migration How should we finish the migration? How do we minimize downtime? What do we need to do to ensure there are no data inconsistencies? How might we rollback? Percona, Inc. 28 / 102

29 Our Migration Steps Shutdown the application pointing to node1 Shutdown (and RESET) replication on node3 from node1 node3> STOP SLAVE; RESET SLAVE ALL; Startup the application pointing to node3 Edit run_sysbench_oltp.sh and change the IP./run_sysbench_oltp.sh to restart it Rebuild node1 as we did for node2 node1# systemctl stop mysql node1# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server Percona, Inc. 29 / 102

30 node1's Config [mysqld] # Leave existing settings and add these binlog_format log-slave-updates = ROW # galera settings wsrep_provider = /usr/lib64/libgalera_smm.so wsrep_cluster_name = mycluster wsrep_cluster_address = gcomm://node1,node2,node3 wsrep_node_name = node1 --!!! CHANGE!!! wsrep_node_address = !!! CHANGE!!! wsrep_sst_auth = sst:secret innodb_autoinc_lock_mode = 2 innodb_locks_unsafe_for_binlog = ON node1# systemctl start mysql Percona, Inc. 30 / 102

31 Where are We Now? Percona, Inc. 31 / 102

32 XtraDB Cluster Tutorial GALERA REPLICATION INTERNALS Percona, Inc. 32 / 102

33 Replication Roles Within the cluster, all nodes are equal master/donor node The source of the transaction. slave/joiner node Receiver of the transaction. Writeset: A transaction; one or more row changes Percona, Inc. 33 / 102

34 State Transfer Roles New nodes joining an existing cluster get provisioned automatically Joiner = New node Donor = Node giving a copy of the data State Snapshot transfer Full backup of Donor to Joiner Incremental Snapshot transfer Only changes since node left cluster Percona, Inc. 34 / 102

35 What's inside a write-set? The payload; Generated from the RBR event Keys; Generated by the source node Primary Keys Unique Keys Foreign Keys Keys are how certification happes A PRIMARY KEY on every table is required Percona, Inc. 35 / 102

36 Replication Steps As Master/Donor Apply Replicate Certify Commit As Slave/Joiner Replicate (receive) Certify (includes reply to master/donor) Apply Commit Percona, Inc. 36 / 102

37 What is Certification? In a nutshell: "Can this transaction be applied?" Happens on every node for all write-sets Deterministic Results of certification are not relayed back to master/donor In event of failure, drop transaction locally; Possibly remove self from cluster. On success, write-set enters apply queue Serialized via group communication protocol (*) Percona, Inc. 37 / 102

38 The Apply Apply is done locally, asynchronously, on each node, after certification. Can be parallelized like standard async replication (more on this later). If there is a conflict, will create brute force aborts (bfa) on local node Percona, Inc. 38 / 102

39 Last but not least, Commit Stage Local, on each node, InnoDB commit. Applier thread executes on slave/joiner nodes. Regular connection thread for master/donor node Percona, Inc. 39 / 102

40 Galera Replication Percona, Inc. 40 / 102

41 Galera Replication Facts "Replication" is synchronous; slave apply is not A successful commit reply to client means a transaction WILL get applied across the entire cluster. A node that tries to commit a conflicting transaction before will get a deadlock error. This is called a local certification failure or 'lcf' When our transaction is applied, any competing transactions that haven't committed will get a deadlock error. This is called a brute force abort or 'bfa' Percona, Inc. 41 / 102

42 Unexpected Deadlocks Create a test table node1 mysql> CREATE TABLE test.deadlocks (i INT UNSIGNED NOT NULL -> PRIMARY KEY, j VARCHAR(32) ); node1 mysql> INSERT INTO test.deadlocks VALUES (1, NULL); node1 mysql> BEGIN; -- Start explicit transaction node1 mysql> UPDATE test.deadlocks SET j = 'node1' WHERE i = 1; Before we commit on node1, go to node3 in a separate window: node3 mysql> BEGIN; -- Start transaction node3 mysql> UPDATE test.deadlocks SET j = 'node3' WHERE i = 1; node3 mysql> COMMIT; node1 mysql> COMMIT; node1 mysql> SELECT * FROM test.deadlocks; Percona, Inc. 42 / 102

43 Unexpected Deadlocks (cont.) Which commit succeeds? Will only a COMMIT generate the deadlock error? Is this a lcf or bfa? How would you diagnose this error? Things to watch during this exercise: myq_status (especially on node1) SHOW ENGINE INNODB STATUS Percona, Inc. 43 / 102

44 Application Hotspots Start sysbench on all the nodes at the same time $ /usr/local/bin/run_sysbench_update_index.sh Watch myq_status 'Conflct' columns for activity $ myq_status wsrep Edit the script and adjust the following until you see some conflicts: --oltp-table-size smaller (250) --tx-rate higher (75) Percona, Inc. 44 / 102

45 When could you multi-node write? When your application can handle the deadlocks gracefully wsrep_retry_autocommit can help When the successful commit to deadlock ratio is not too small Percona, Inc. 45 / 102

46 AUTO_INCREMENT Handling Pick one node to work on: node2 mysql> create table test.autoinc ( -> i int not null auto_increment primary key, -> j varchar(32) ); node2 mysql> insert into test.autoinc (j) values ('node2' ); node2 mysql> insert into test.autoinc (j) values ('node2' ); node2 mysql> insert into test.autoinc (j) values ('node2' ); Now select all the data in the table node2 mysql> select * from test.autoinc; What is odd about the values for 'i'? What happens if you do the inserts on each node in order? Check wsrep_auto_increment_control and auto_increment% global variables Percona, Inc. 46 / 102

47 The Effects of Large Transactions Sysbench on node1 as usual (Be sure to edit script and reset connection to localhost) node1# run_sysbench_oltp.sh Insert 100K rows into a new table on node2 node2 mysql> CREATE TABLE test.large LIKE test.sbtest1; node2 mysql> INSERT INTO test.large SELECT * FROM test.sbtest1; Pay close attention to how the application (sysbench) is affected by the large transactions Percona, Inc. 47 / 102

48 Serial and Parallel Replication Every node replicates, certifies, and commits EVERY transaction in the SAME order This is serialization Replication allows interspersing of smaller transactions simultaneously Big transactions tend to force serialization on this The parallel slave threads can APPLY transactions with no interdependencies in parallel on SLAVE nodes (i.e., nodes where the trx originated from) Infinite wsrep_slave_threads!= infinite scalability Percona, Inc. 48 / 102

49 XtraDB Cluster Tutorial ONLINE SCHEMA CHANGES Percona, Inc. 49 / 102

50 How do we ALTER TABLE Directly - ALTER TABLE "Total Order Isolation" - TOI (default) "Rolling Schema Upgrades" - RSU pt-online-schema-change Percona, Inc. 50 / 102

51 Direct ALTER TABLE (TOI) Ensure sysbench is running against node1 as usual Try these ALTER statements on the nodes listed and note the effect on sysbench throughput. node1 mysql> ALTER TABLE test.sbtest1 ADD COLUMN `m` varchar(32); node2 mysql> ALTER TABLE test.sbtest1 ADD COLUMN `n` varchar(32); Create a different table not being touched by sysbench node2 mysql> CREATE TABLE test.foo LIKE test.sbtest1; node2 mysql> INSERT INTO test.foo SELECT * FROM test.sbtest1; node2 mysql> ALTER TABLE test.foo ADD COLUMN `o` varchar(32); Does running the command from another node make a difference? What difference is there altering a table that sysbench is not using? Percona, Inc. 51 / 102

52 Direct ALTER TABLE (RSU) Change to RSU: node2 mysql> SET wsrep_osu_method = 'RSU'; node2 mysql> alter table test.foo add column `p` varchar(32); node2 mysql> alter table test.sbtest1 add column `q` varchar(32); Watch myq_status and the error log on node2: node2 mysql> alter table test.sbtest1 drop column `c`; What are the effects on sysbench after adding the columns p and q? What happens after dropping 'c'? Why? How do you recover? When is RSU safe and unsafe to use? Percona, Inc. 52 / 102

53 pt-online-schema-change Be sure all your nodes are set back to TOI node* mysql> SET GLOBAL wsrep_osu_method = 'TOI'; Let's add one final column: node2# pt-online-schema-change --alter \ "ADD COLUMN z varchar(32)" D=test,t=sbtest1 --execute Was pt-online-schema-change any better at doing DDL? Was the application blocked at all? What are the caveats to using pt-osc? Percona, Inc. 53 / 102

54 XtraDB Cluster Tutorial APPLICATION HIGH AVAILABILITY Percona, Inc. 54 / 102

55 ProxySQL Architecture Percona, Inc. 55 / 102

56 Setting up HAProxy Install HAproxy - Just on node1 Create /etc/haproxy/haproxy.cfg Copy from /root/haproxy.cfg Restart haproxy Test connection to proxy ports node1# mysql -u test -ptest -h P 4306 \ -e "show variables like 'wsrep_node_name'" node1# mysql -u test -ptest -h P 5306 \ -e "show variables like 'wsrep_node_name'" Test shutting down nodes Watch Percona, Inc. 56 / 102

57 HAProxy Maintenance Mode Put a node in to maintenance mode node1# echo "disable server cluster-reads/node3" socat \ /var/lib/haproxy/haproxy.sock stdio Refresh HAProxy status page Test connections to port 5306 node3 is still online but HAProxy won't route connections it Enable the node node1# echo "enable server cluster-reads/node3" socat \ /var/lib/haproxy/haproxy.sock stdio Refresh status page and test connections Percona, Inc. 57 / 102

58 keepalived Architecture Percona, Inc. 58 / 102

59 keepalived for simple VIP Install keepalived (all nodes) Create /etc/keepalived/keepalived.conf Copy from /root/keepalived.conf systemctl start keepalived Check for which host has the VIP ip addr list grep Experiment with restarting nodes while checking host answering the vip while( true; ) do mysql -u test -ptest -h connect_timeout=5 \ -e "show variables like 'wsrep_node_name'"; sleep 1; done Percona, Inc. 59 / 102

60 Application HA Essentials Nodes are properly health checked HA is mostly commonly handled at layer4 (TCP) Therefore, expect node failures to mean app reconnects The cluster will prevent apps from doing anything bad But, the cluster may not necessarily force apps to reconnect by itself Watch out for Donor nodes Percona, Inc. 60 / 102

61 XtraDB Cluster Tutorial MONITORING GALERA Percona, Inc. 61 / 102

62 At a Low Level mysql> SHOW GLOBAL VARIABLES LIKE wsrep% ; mysql> SHOW GLOBAL STATUS LIKE wsrep% ; Refer to: Percona, Inc. 62 / 102

63 Example Nagios Checks node1# yum install percona-nagios-plugins -y -- Verify we re in a Primary cluster node1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_cluster_status -C == -T str -c non-primary -- Verify the node is Synced node1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_local_state_comment -C '!=' -T str -w Synced -- Verify the size of the cluster is normal node1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_cluster_size -C '<=' -w 2 -c 1 -- Watch for flow control node1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_flow_control_paused -w 0.1 -c Percona, Inc. 63 / 102

64 Alerting Best practices Alerting should be actionable, not noise. Any other ideas of types of checks to do? Percona, Inc. 64 / 102

65 Percona Monitoring and Management (*) Percona, Inc. 65 / 102

66 ClusterControl - SeveralNines Percona, Inc. 66 / 102

67 XtraDB Cluster Tutorial INCREMENTAL STATE TRANSFER Percona, Inc. 67 / 102

68 Incremental State Transfer (IST) Helps to avoid full SST When a node starts, it knowns the UUID of the cluster to which it belongs and the last sequence number it applied. Upon connecting to other cluster nodes, broadcast this information. If another node has cached write-sets, trigger IST If no other node has, trigger SST Firewall Alert: Uses port 4568 by default Percona, Inc. 68 / 102

69 Galera Cache - "gcache" Write-sets are contained in the "gcache" on each node. Preallocated file with a specific size. Default size is 128M Represents a single-file circular buffer. Can be increase via provider option wsrep_provider_options = "gcache.size=2g"; Cache is mmaped (I/O buffered to memory) wsrep_local_cached_downto - The first seqno present in the cache for that node Percona, Inc. 69 / 102

70 Simple IST Test Make sure your application is running Watch myq_status on all nodes On another node watch mysql error log stop mysql start mysql How did the cluster react when the node was stopped and restarted? Was the messaging from the init script and in the log clear? Is it easy to tell when a node is a Donor for SST vs IST? Percona, Inc. 70 / 102

71 XtraDB Cluster Tutorial NODE FAILURE SCENARIOS Percona, Inc. 71 / 102

72 Node Failure Exercises Run sysbench on node1 as usual, also watch myq_status Experiment 1: Clean node restart node3# systemctl restart mysql Percona, Inc. 72 / 102

73 Node Failure Exercises (cont.) Experiment 2: Partial network outage Simulate a network outage by using iptables to drop traffic node3# iptables -A INPUT -s node1 \ -j DROP; iptables -A OUTPUT -s node1 -j DROP After observation, clear rules node3# iptables -F Percona, Inc. 73 / 102

74 Node Failure Exercises (cont.) Experiment 3: node3 stops responding to the cluster Another network outage affecting 1 node -- Pay Attention to the command continuation backslashes! -- This should all be 1 line! node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A INPUT -s node2 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node2 -j DROP Remove iptables rules (after experiments 2 and 3) node3# iptables -F Percona, Inc. 74 / 102

75 A Non-Primary Cluster Experiment 1: Clean stop of 2/3 nodes (normal sysbench load) node2# systemctl stop mysql node3# systemctl stop mysql Percona, Inc. 75 / 102

76 A Non-Primary Cluster (cont.) Experiment 2: Failure of 50% cluster (leave node2 down) node3# systemctl start mysql -- Wait a few minutes for IST to finish and for the -- cluster to become stable. Watch the error logs and -- SHOW GLOBAL STATUS LIKE 'wsrep%' node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP Force node1 to become primary like this: node1 mysql> set global wsrep_provider_options="pc.bootstrap=true"; Clear iptables when done node3# iptables -F Percona, Inc. 76 / 102

77 Using garbd as a third node Node2 s mysql is still stopped! Install the arbitrator daemon on to node2 node2# yum install Percona-XtraDB-Cluster-garbd-3.x86_64 node2# garbd -a gcomm://node1:4567,node3:4567 -g mycluster Experiment 1: Reproduce node3 failure node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A INPUT -s node2 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node2 -j DROP Percona, Inc. 77 / 102

78 Using garbd as a third node (cont.) Experiment 2: Partial node3 network failure node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP Remember to clear iptables when done node3# iptables -F Percona, Inc. 78 / 102

79 Recovering from a clean all-down state Stop mysql on all 3 nodes $ systemctl stop mysql Use the grastate.dat to find the node with the highest GTID $ cat /var/lib/mysql/grastate.dat Bootstrap that node and then start the others normally $ systemctl start mysql@bootstrap $ systemctl start mysql $ systemctl start mysql Why do we need to bootstrap the node with the highest GTID? Why can t the cluster just figure it out? Percona, Inc. 79 / 102

80 Recovering from a dirty all-down state Kill -9 mysqld on all 3 nodes $ killall -9 mysqld_safe; killall -9 mysqld Check the grastate.dat again. Is it useful? $ cat /var/lib/mysql/grastate.dat Try wsrep_recover on each node to get the GTID $ mysqld_safe --wsrep_recover Bootstrap the highest node and then start the others normally What would happen if we started the wrong node? Wsrep Recovery pulls GTID data from the Innodb transaction logs. In what cases might this be inaccurate? Percona, Inc. 80 / 102

81 XtraDB Cluster Tutorial AVOIDING SST Percona, Inc. 81 / 102

82 Manual State Transfer with Xtrabackup Take a backup on node3 (sysbench on node1 as usual) node3# xtrabackup --backup --galera-info --target-dir=/var/backup Stop node3 and prepare the backup node3# systemctl stop mysql node3# xtrabackup --prepare --target-dir=/var/backup Percona, Inc. 82 / 102

83 Manual State Transfer with Xtrabackup Copy back the backup node3# rm -rf /var/lib/mysql/* node3# xtrabackup --copy-back --target-dir=/var/backup node3# chown -R mysql.mysql /var/lib/mysql Get GTID from backup and update grastate.dat node3# cd /var/lib/mysql node3# mv xtrabackup_galera_info grastate.dat node3# $EDIT grastate.dat # GALERA saved state version: 2.1 uuid: 8797f811-7f73-11e b513b3819c1 seqno: safe_to_bootstrap: 0 node3# systemctl start mysql Percona, Inc. 83 / 102

84 Cases where the state is zeroed Add a bad config item to section [mysqld] in /etc/my.cnf [mysqld] fooey Check what happens to the grastate: node3# systemctl stop mysql node3# cat /var/lib/mysql/grastate.dat node3# systemctl start mysql node3# cat /var/lib/mysql/grastate.dat Remove the bad config and use wsrep_recover to avoid SST node3# mysqld_safe --wsrep_recover Update the grastate.dat with that position and restart What would have happened if you restarted with a zeroed grastate.dat? Percona, Inc. 84 / 102

85 When you WANT an SST node3# cat /var/lib/mysql/grastate.dat Turn off replication for the session: node3 mysql> SET wsrep_on = OFF; Repeat until node3 crashes (sysbench should be running on node1) node3 mysql> delete from test.sbtest1 limit 10000;..after crash.. node3# cat /var/lib/mysql/grastate.dat What is in the error log after line the crash? Should we try to avoid SST in this case? Percona, Inc. 85 / 102

86 Performing Checksums Create a User node1 mysql> CREATE USER 'checksum'@'%'; node1 mysql> ALTER USER 'checksum'@'%' IDENTIFIED BY 'checksum1'; node1 mysql> GRANT SUPER, PROCESS, SELECT ON *.* TO 'checksum'@'%'; node1 mysql> GRANT CREATE, SELECT, INSERT, UPDATE, DELETE ON percona.* TO 'checksum'@'%'; Create Dummy Table node1 mysql> CREATE TABLE test.fooey (id int PRIMARY KEY, name varchar(20)); node1 mysql> INSERT INTO test.fooey VALUES (1, 'Matthew'); Verify row exists on all nodes Percona, Inc. 86 / 102

87 Performing Checksums (cont.) Run Checksum node1# pt-table-checksum -d test --recursion-method cluster \ -u checksum -p checksum1 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 04-23T16:42: test.fooey 04-23T16:37: test.sbtest1 Change data without replicating! node3 mysql> SET wsrep_on = OFF; node3 mysql> UPDATE test.fooey SET name = 'Charles'; Re-run checksum command above on node1 Who has the error? ALL mysql> SELECT COUNT(*) FROM checksums WHERE this_crc<>master_crc; Percona, Inc. 87 / 102

88 Fix Bad Data! Use pt-table-sync between a good node and the bad node node1# pt-table-sync --execute h=node1,d=test,t=fooey h=node Percona, Inc. 88 / 102

89 XtraDB Cluster Tutorial TUNING REPLICATION Percona, Inc. 89 / 102

90 What is Flow Control? Ability for any node in the cluster to tell the other nodes to pause writes while it catches up Feedback mechanism for the replication process ONLY caused by wsrep_local_recv_queue exceeding a node's fc_limit This can pause the entire cluster and look like a cluster stall! Percona, Inc. 90 / 102

91 Observing Flow Control Run sysbench on node1 as usual Take a read lock on node3 and observe its effect on the cluster node3 mysql> SET GLOBAL pxc_strict_mode = 0; node3 mysql> LOCK TABLE test.sbtest1 WRITE; Make observations. When you are done, release the lock: node3 mysql> UNLOCK TABLES; What happens to sysbench throughput? What are the symptoms of flow control? How can you detect it? Percona, Inc. 91 / 102

92 Tuning Flow Control Run sysbench on node1 as usual Increase the fc_limit on node3 node3 mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=500"; Take a read lock on node3 and observe its effect on the cluster node3 mysql> LOCK TABLE test.sbtest1 WRITE; Make observations. When you are done, release the lock: node3 mysql> UNLOCK TABLES; Now, how big does the recv queue on node3 get before FC kicks in? Percona, Inc. 92 / 102

93 Flow Control Parameters gcs.fc_limit Pause replication if recv queue exceeds that many writesets. Default: 100. For master-slave setups this number can be increased considerably. gcs.fc_master_slave When this is NO, then the effective gcs.fc_limit is multipled by sqrt( number of cluster members ). Default: NO. Ex: gcs.fc_limit = 500 * sqrt(3) = Percona, Inc. 93 / 102

94 Avoiding Flow Control during Backups Most backups use FLUSH TABLES WITH READ LOCK By default, this causes flow control almost immediately We can permit nodes to fall behind: node3 mysql> SET GLOBAL wsrep_desync=on; node3 mysql> FLUSH TABLES WITH READ LOCK; Do not write to this node! No longer the case with 5.7! Yay! Percona, Inc. 94 / 102

95 More Notes on Flow Control If you set it too low: Too many cries for help from random nodes in the cluster If you set it too high: Increases in replication conflicts in multi-node writings Increases in apply/commit lag on nodes That one node with FC issues: Deal with that node: bad hardware? not same specs? Percona, Inc. 95 / 102

96 Testing Node Apply throughput Using wsrep_desync and LOCK TABLE, we can stop replication on a node for an arbitrary amount of time (until it runs out of disk-space). We can stop a node for a while, release it, and measure the top speed at which it applies. This gives is an idea of how fast the node is capable of handling the bulk of replication. Performance here is greatly improved in 5.7(*) (*) Percona, Inc. 96 / 102

97 Multiple Applier Threads Using the wsrep_desync trick, you can measure the throughput of a slave with any setting of wsrep_slave_threads you wish. wsrep_cert_deps_distance is the maximum "theoretical" parallel threads Practically, it's likely you should probably have much fewer slave threads (500 is a bit much) Parallel apply has been a source of node inconsistencies in the past Percona, Inc. 97 / 102

98 Wsrep Sync Wait Apply is asynchronous, so reads after writes on different nodes may see an older view of the database. Practically, flow control prevents this from becoming like regular Async replication Tighter consistency can be enforced by ensuring the replication queue is flushed before a read SET <session global> wsrep_sync_wait=[0-7]; Percona, Inc. 98 / 102

99 XtraDB Cluster Tutorial CONCLUSION Percona, Inc. 99 / 102

100 Common Configuration Parameters wsrep_sst_auth = un:pw wsrep_osu_method = 'TOI' wsrep_desync = ON; wsrep_slave_threads = 4; wsrep_retry_autocommit = 4; Set SST Authentication Set DDL Method Enable "Replication" Number of applier threads Number of commit retries wsrep_provider_options: * gcs.fc_limit = 500 Increase flow control threshold * pc.bootstrap = true Put node into single mode * gcache.size = 2G Increase Galera Cache file wsrep_sst_donor = node1 Specify donor node Percona, Inc. 100 / 102

101 Common Status Counters wsrep_cluster_status wsrep_local_state_comment wsrep_cluster_size wsrep_flow_control_paused wsrep_local_recv_queue Primary / Non-Primary Synced / Donor / Desync Number of online nodes Number of seconds cluster has paused Current number of write-sets waiting to be applied Percona, Inc. 101 / 102

102 Questions? (*) Slide content authors: Jay Janssen, Fred Descamps, Matthew Boehm Percona, Inc. 102 / 102

103 PXC - Hands Tutorial

104 ist/sst donor selection

105 concept of segment PXC has concept of segment (gmcast.segment) Check if you can find out gmcast.segment (by querying select Assign same segment id to cluster node that are co-located together in same subnet/lan. Assign different segment id to cluster node that are located in different subnet/wan subnet-1 subnet-2 gmcast.segment=0 gmcast.segment=1 exercise SELECT LOCATE('gmcast.segment', 18);

106 sst donor selection SST DONOR Selection Search by name (if specified by user --wsrep_sst_donor). If node state restrict it from acting as DONOR things get delayed. Member 0.0 (n3) requested state transfer from 'n2', but it is impossible to select State Transfer donor: Resource temporarily unavailable Search donor by state: Scan all nodes. Check if a node can act as DONOR (avoid DESYNCEd node, arbitrator). YES: Is node part of same segment (like JOINER) -> Yes -> Got DONOR NO: Keep searching but cache this remote segment too. Worse case will use this.

107 ist donor selection cluster see workload n1 wset-1 wset-2 n2 n3

108 ist donor selection workload is processed and replicated on all the nodes n1 wset-2 wset-1 n2 n3 wset-2 wset-1 wset-2 wset

109 ist donor selection n3 needs to leave cluster for sometime n1 wset-2 wset-1 wset-2 n2 wset-1 wset-2 n3 wset-1

110 ist donor selection cluster makes further progress wset-4 wset-3 wset-2 n1 wset-1 wset-4 n3 n2 wset-1 wset-3 wset-2 wset-1 wset-2

111 ist donor selection n1/n2 needs to purge gcache. wset-4 wset-3 wset-2 n1 wset-1 wset-4 n3 n2 wset-1 wset-3 wset-2 wset-1 wset-2

112 ist donor selection n3 need to re-join the cluster. Which node should it select and Why? wset-3 wset-4 wset-2 n1 wset-1 wset-4 n3 n2 wset-1 wset-3 wset-2 wset-1 wset-2

113 ist donor selection n3 will select n2 given that n2 has purged enough and probability of it purging further (with respect to n1) is low. wsrep_local_cached_downto is variable to look for. It indicate the lowest sequence number held. Concept of Safety-Gap. wset-3 wset-4 wset-2 n1 wset-1 wset-4 n3 n2 wset-1 wset-3 wset-2 wset-1 wset-2

114 Test? ist donor selection 4 nodes n1, n2, n3, n4. n1 and n2 are in one-segment n3 and n4 are in another-segment n3 needs to leave cluster. n3 has trx upto write-set (100). Cluster make progress and purges too. Latest available write-sets: n1 (200), n2 (250), n4 (200). n3 want to re-join. Which DONOR will selected and Why?

115 galera gtid

116 gtid and global-seqno MySQL GTID vs Galera GTID MySQL GTID is based on server_uuid. This means it is different on all the nodes. Galera GTID is used explicitly for replication and it same on all the nodes that are part of the cluster. select select show status like wsrep_cluster_state_uuid ; Global Sequence Number: Incoming queue is like FIFO Each write-set is delivered to all the nodes of cluster in same order. Galera GTID is assigned only for successful action. exercise 6a da4b-ee18-6bf3-f0b4e6e19c7d:1-3

117 understanding timeout

118 understanding timeout N1 N2 N3 N1 N2 N3 Group Channel State-1: 3 nodes cluster all are healthy. When there is no workload keep_alive signal is sent every 1 sec Group Channel State-2: N1 has flaky network connection and can t respond in set inactive_check_period. Cluster (N2, N3) mark N1 to be added delayed list and adds it once it crosses delayed margin. N1 N2 N3 N1 N2 N3 Group Channel State-3: N1 network connection is restored but to get it out from delayed listed N2 and N3 needs assurance from N1 for delayed_keep_period. Group Channel State-4: N1 again has flaky connection and it doesn t respond back for inactive_timeout. (Node is already in delayed list). This will cause node to pronounce as DEAD.

119 geo-distributed cluster

120 geo-distributed cluster Nodes of cluster are distributed across data-centers still all nodes are part of one-big cluster. N1 N2 WAN N4 N5 N3 N6 Important aspect to consider: gmcast.segment that help notify topology. All local nodes should have same gmcast.segment. This helps in optimizing replication traffic and of-course SST and IST. Tune timeouts if needed for your WAN setup. Avoid flow-control and fragmentation at galera level (example settings: gcs.max_packet_size= ; evs.send_window=512; evs.user_send_window=256")

121 pxc-5.7

122 Percona XtraDB Cluster 5.7 PXC-5.7 GAed during PLAM-2016 (Sep-2016) Since then we have done 3 more releases PXC PXC Sep-16 Dec-16 Mar-17 Apr-17 PXC PXC

123 Percona XtraDB Cluster 5.7 Cluster-Safe Mode (pxc-strict-mode) Extended support for PFS Support for encrypted tablespace Improved security ProxySQL compatible PXC Tracking IST/Flow-control Improved logging and wsrep-stage frame-work Fully Scalable Performance Optimized PXC PXC + PMM Simplified packaging

124 cluster-safe-mode

125 cluster-safe-mode Non transactional SE CTAS table-wo-primary-key Why pxc-strict-mode? binlog_format=row LOCAL LOCKS to block experimental features LOCAL Ops (ALTER IMPORT/DISCARD)

126 cluster-safe-mode pxc_strict_mode ENFORCING=ERROR PERMISSIVE=WARNING MASTER=ERROR (except locks) DISABLED=5.6-compatible Mode is local to the given node. All nodes ideally should operate in same mode. Not applicable on replicated traffic. Applicable only on user-traffic. Moving to stricter mode, make sure validation checks passes.

127 cluster-safe-mode Try to insert into a non-pk table. mysql> use test; Database changed mysql> create table t (i int) engine=innodb; Query OK, 0 rows affected (0.04 sec) mysql> insert into t values (1); ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.t) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER mysql> alter table t add primary key pk(i); Query OK, 0 rows affected (0.06 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> insert into t values (1); Query OK, 1 row affected (0.01 sec) exercise

128 cluster-safe-mode Try to alter myisam table to innodb mysql> alter table t engine=myisam; drop table t; ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits changing storage engine of a table (test.t) from transactional to non-transactional with pxc_strict_mode = ENFORCING or MASTER mysql> create table myisam (i int) engine=myisam; Query OK, 0 rows affected (0.01 sec) mysql> alter table myisam engine=innodb; Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> show create table myisam; drop table myisam; Table Create Table myisam CREATE TABLE `myisam` ( `i` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin exercise

129 cluster-safe-mode Switching to DISABLED and back to ENFORCING mysql> set global pxc_strict_mode='disabled'; Query OK, 0 rows affected (0.00 sec) mysql> create table myisam (i int) engine=myisam; Query OK, 0 rows affected (0.01 sec) mysql> set wsrep_replicate_myisam=1; Query OK, 0 rows affected (0.00 sec) mysql> insert into myisam values (10); Query OK, 1 row affected (0.00 sec) Check if you see entry (10) on node-2. What will happen if we now switch BACK from DISABLED TO ENFORCING? exercise

130 cluster-safe-mode Back to ENFORCING mysql> set global pxc_strict_mode='enforcing'; ERROR 1105 (HY000): Can't change pxc_strict_mode to ENFORCING as wsrep_replicate_myisam is ON mysql> set wsrep_replicate_myisam=0; Query OK, 0 rows affected (0.00 sec) mysql> set global pxc_strict_mode='enforcing'; Query OK, 0 rows affected (0.00 sec) node1 mysql> drop table myisam; Query OK, 0 rows affected (0.00 sec) Switching will check for the following preconditions: wsrep_replicate_myisam = 0 binlog_format = ROW log_output = NONE/FILE isolation level!= SERIALIZABLE exercise

131 cluster-safe-mode Replicated traffic is not scrutinized (not for TOI statement MyISAM DML). node-1: mysql> set global pxc_strict_mode='disabled'; Query OK, 0 rows affected (0.00 sec) mysql> create table t (i int) engine=innodb; Query OK, 0 rows affected (0.06 sec) mysql> insert into t values (100); Query OK, 1 row affected (0.01 sec) Remove all tables we created in this experiment. (t, myisam). Check the modes are reset back to ENFORCING mysql> set global pxc_strict_mode='enforcing'; Query OK, 0 rows affected (0.00 sec) node-2: mysql> select * from test.t; i row in set (0.00 sec) exercise

132 pxc+pfs

133 performance-schema PFS has concept of instruments. DB internal objects can be called instrument. Like mutexes, condition variables, files, threads, etc... User can decide what all instruments to monitor based on need. PXC objects to monitor: PXC threads (doesn t generate event directly) PXC files (doesn t generate event directly) PXC stages PXC mutexes PXC condition variables

134 performance-schema: threads Applier/Slave Thread(s) Rollback/Aborter Thread IST Receiver/Sender Thread SST Joiner/Donor Thread Gcache File Remove Thread Receiver Thread Service Thread GComm Thread Write-set Checkum Thread

135 performance-schema: files Ring Buffer File (GCache) GCache Overflow Page File RecordSet File grastate file/gvwstate file

136 performance-schema: stages Any PXC action passes through different stages : Writing/Deleting/Updating Rows Applying write-sets Commit/Rollback Replicating Commit Replaying Preparing for TO Isolation Applier Idle Rollback/Aborted Idle Aborter Active

137 performance-schema: mutex/cvar Mutex/Cond-Variable/Monitor. Mutexes/Cond-Variables needed for coordination at each stage: wsrep_ready, wsrep_sst_init, wsrep_sst, wsrep_rollback, wsrep_slave_threads, wsrep_desync, wsrep_sst_thread, wsrep_replaying. certification mutex, stats collection mutex, monitor mutexes (that control parallel replication), recvbuf mutex, transaction handling mutex, saved state mutexes, etc.. and corresponding condition-variables. Monitor is like mutex that enforce ordering. Only single thread is allowed to reside in monitor and which threads goes next depends on global_seqno.

138 why use pfs in pxc? Mainly to monitor the cluster For performance monitoring and tuning What is taking time? Certification? Replication? Commit? What is current state of the processing thread (processing + commit wait) Track transaction rollback incident?. etc. More to come from performance analysis perspective in future.

139 performance schema Let s track some wsrep-stages Enable instrument to track on both node of cluster. (node-1, node-2) use performance_schema; update setup_instruments set enabled=1, timed=1 where name like 'stage%wsrep%'; update setup_consumers set enabled=1 where name like 'events_stages_%'; Start sysbench workload Let s start tracking some state and time it takes. Remember to disable PFS back to off once done. mysql> select thread_id, event_name, timer_wait/ as msec from events_stages_history where event_name like '%in replicate stage%' order by msec desc limit 10; thread_id event_name msec stage/wsrep/wsrep: in replicate stage exercise Let s now see the Slave View

140 performance schema Let s track conflict Enable instrument to track on both node of cluster. (node-1, node-2) use performance_schema; update setup_instruments set enabled= YES, timed= YES where name like 'wait%wsrep%'; update setup_consumers set enabled= YES where name like '%wait%'; Run a conflicting workload (bf_abort) Track the events generated. Remember to disable PFS back to off once done. mysql> select thread_id, event_id, event_name, timer_wait/ from events_waits_history_long where event_name like '%COND%rollback%'; thread_id event_id event_name timer_wait/ wait/synch/cond/sql/cond_wsrep_rollback wait/synch/cond/sql/cond_wsrep_rollback rows in set (0.00 sec) exercise

141 pxc + secure

142 support for encrypted tablespace MySQL-5.7 introduced an encrypted tablespace. If your existing node of cluster (DONOR) has encrypted tablespace then you should able to retain the same encrypted tablespace in encrypted fashion after donating it to the JOINER. PXC takes care of moving the keyring (needed for unencrypting tablespace) and then re-encrypting it with new keyring. (All this is done with help of XB (Xtrabackup)). All this is done transparently. Since the keyring is transmitted between the 2 nodes it is important to ensure that SST connection is encrypted and both the node should have keyring plugin configured.

143 improved security PXC internode traffic needs to be secured. There are 2 main type of traffic SST that is during the initial node-setup with involvement of 3rd party tool (XB). IST/Replication traffic. Internal between PXC node. PXC-5.7 supports a mode encrypt=4 that accept the familiar triplet of ssl-key/ca/cert just like MySQL-Client server communication. Remember to enable it on all nodes. [sst] encrypt=4 ssl-ca=<path>/ca.pem ssl-cert=<path>/server-cert.pem ssl-key=<path>/server-key.pem MySQL auto-generate these certificates during bootstrap you can continue to use them but ensure all nodes uses same file or same ca-file for generating server-cert.

144 improved security For IST/Replication-traffic already supports triplet and here is way to specify it wsrep_provider_options="socket.ssl=yes;socket.ssl_key=/path/to/server-key.pem;socket.ssl_cert=/path/to/server-c ert.pem;socket.ssl_ca=/path/to/cacert.pem"

145 WAIT!!!! I am first time user. This sounds bit tedious. pxc-encrypt-cluster-traffic help to secure SST/IST/Replication traffic by passing default key/ca/cert generated by MySQL. Just set this variable to ON (pxc-encrypt-cluster-traffic=on). It will looks for files in mysqld section (One that mysql uses for its own SSL connection) or in data directory for auto-generated files. It then auto-pass them for SST/IST secure connection. OK. that s simple NOW!!!! improved security

146 pxc + proxysql

147 pxc + proxysql ProxySQL is one of the leading load balancer. PXC is fully integrated with ProxySQL. ProxySQL can monitor PXC nodes Percona has developed a simple single step solution to auto-configure proxysql tracking tables by auto-discovering pxc nodes. (proxysql-admin) Same script can be used to also configure different setup LoadBalancer: Each node is marked for read/write SingleWriter: Only one node service WRITE. Other nodes can be used for READ.

148 proxyql assisted pxc mode Often it is important to take a node down for some maintenance. Shutdown the server No load on server Naturally a user would like to perform these operation directly on node but is could be risky with LoadBalancer involved as it may still be servicing it as active load query. To make easier we now allow user to set this using a single step process pxc_maint_mode (default=disabled) pxc_maint_transition_period (default=10)

149 proxyql assisted pxc mode pxc_maint_mode SHUTDOWN On shutdown of node. Node wait for pxc_maint_transition_period before delivering shutdown signal. DISABLED User switch the mode to MAINTENANCE. Node wait for pxc_maint_transition_period before returning prompt but mark the status updated immediately allowing proxysql to update the status On completion of maintenance user can switch it back to DISABLED. MAINTENANCE Monitor for change in status and accordingly update the internal table to avoid sending traffic ProxySQL

150 proxyql assisted pxc mode Install ProxySQL. cd /var/repo; yum install proxysql x86_64.rpm which proxysql proxysql --version which proxysql-admin proxysql-admin -v cat /etc/proxysql-admin.cnf node1 mysql> CREATE USER % IDENTIFIED by admin ; node1 mysql> GRANT all privileges on *.* to % ; node1 mysql> FLUSH privileges; exercise

151 proxyql assisted pxc mode Ensure proxysql is running. If not start proxysql in foreground using proxysql -f mysql -uadmin -padmin -P h (... this will open proxysql terminal) proxysql-admin -e --write-node=" :3306" (...proxysql user doesn t have privileges) node1 mysql> GRANT all privileges on *.* to 192.% ; node1 mysql> FLUSH privileges; Now check and ensure the said information in proxysql mysql -h P uadmin -padmin (connecting to proxysql) select * from mysql_users;... mysql_servers; schedulers; Check if proxysql is able to route connection to server mysql --user=proxysql_user --password=passw0rd --host=localhost --port= protocol=tcp -e exercise

152 proxyql assisted pxc mode Check the routing rules through mysql_server table of proxysql. node1 mysql> select * from mysql_servers; hostgroup_id hostname port status weight compression max_connections max_replication_lag use_ssl max_latency_ms comment ONLINE WRITE ONLINE READ ONLINE READ rows in set (0.00 sec) Start monitoring to check if the proxysql is multiplexing the query among 2 host. watch -n 0.1 "mysql -u proxysql_user -P ppassw0rd -h protocol=tcp -e exercise

153 proxyql assisted pxc mode Let s shutdown node-2 and check multiplexing. Monitor status watch -n 0.1 "mysql -uadmin -padmin -P h e \"select status from mysql_servers\"" Monitor pxc_maint_mode watch -n 0.1 "mysql -e Shutdown and then restart node-2 systemctl stop mysql (check if multiplexing now stabilizes to node3) systemctl start mysql exercise

154 track more using show status

155 tracking ist Often a node is taken down (shutdown of server) for some maintenance and re-starting it can cause IST. Size of IST depends on write-set processed. It can take noticeable time for IST activity to complete (While IST is in progress node is not usable). Now we can track IST progress through wsrep_ist_receive_status wsrep_ist_receive_status 75% complete, received seqno 1980 of watch -n 0.1 "mysql -e \"show status like 'wsrep_ist_receive_status'\"" Let s see this happening. We will shutdown node-2 do some quick 30 sec OLTP workload on node-1 and then restart node-2 and track IST. exercise Check what is the wsrep_last_committed post IST.

156 tracking flow control Flow-control is self-regulating mechanism to avoid a case where-in weakest/slower node the cluster falls behind main-cluster by a significant gap. Each replicating node has a incoming/receiving queue. Size of this queue is configurable. Default for now is 100. (Was 16 initially) If there are multiple nodes in cluster then this size is adjusted based on number of nodes. (sqrt(number-of-nodes) * configured size) Size configurable through gcs.fc_limit variable.default for now is 100. (Was 16 initially) Setting on one node is not replicated to other nodes.

157 tracking flow control Starting new release this size is exposed through a variable 'wsrep_flow_control_interval'. mysql> show status like ' wsrep_flow_control_interval'; Variable_name Value wsrep_flow_control_interval [ 141, 141 ] row in set (0.01 sec) Concept of lower and higher water mark. (gcs.fc_factor) Exercise time: We will change flow-control limit using exercise gcs.fc_limit and see if the effect.

158 tracking flow control Often a node enters flow-control depending on size of the queue and load of the server. It is important to help track if the node is in flow-control. Till date this was difficult to find out. Now we have an easy way out wsrep_flow_control_status mysql> show status like 'wsrep_flow_control_status'; Variable_name Value wsrep_flow_control_status ON Variable is boolean and will be turned ON and OFF depending on if node is in flow-control. We will force node to pause and see exercise if the FLOW_CONTROL KICKS IN. * sysbench based flow-control

159 effect of increasing flow control Based on the reading so far it may sound interesting to increase flow-control as the node will eventually catch-up. So why not increase flow-control to a larger value (say 10K, 25K?) It has 3 main back-falls (Say you set fc_limit to 10K and you have 8K write-sets replicated on slave waiting to get applied) If slave fails then it loses those 8K write-sets and on restart will demand those 8K+ write-set through IST (even if you immediately restart the node in few seconds). If the DONOR gcache doesn t have these write-sets since the gap is large it will invite SST that could be time-consuming. If you care about fresh data and have wsrep_sync_wait=7 firing SELECT query will place it at 8K+1 slot and will wait for those 8K write-set to get serviced.. Increasing SELECT (DML) latency

160 improved logging and stages

161 improved sst/ist logging We have improved sst/ist logging. Will love to hear back if you have more comments on this front. Major changes: Made logging structure easy to grasp. FATAL error are now highlighted to sight them easily. In case of FATAL error XB logs are appended to main node logs. Lot of sanity checks inserted to help catch error early. wsrep-log-debug=1 ([sst] section) to suppress those informational message for normal run. Some major causes of error Wrong config params (usually credential or port or host) Stale process from last run or some other failed attempt left out.

162 improved sst/ist logging T10:08: Z 1 [Note] WSREP: GCache history reset: old( :0) -> new(461a9eac-20fa-11e7-a31c-4f98008f8d11:0) :38: WSREP_SST: [INFO] Proceeding with SST :38: WSREP_SST: [INFO]...Waiting for SST streaming to complete! T10:08: Z 0 [Note] WSREP: (4b82fdcc, 'tcp:// :5030') turning message relay requesting off T10:08: Z 0 [Note] WSREP: 0.0 (n1): State transfer to 1.0 (n2) complete T10:08: Z 0 [Note] WSREP: Member 0.0 (n1) synced with group :38: WSREP_SST: [INFO] Preparing the backup at /opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2//.sst :38: WSREP_SST: [INFO] Moving the backup to /opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/ :38: WSREP_SST: [INFO] Galera co-ords from recovery: 461a9eac-20fa-11e7-a31c-4f98008f8d11: T10:08: Z 0 [Note] WSREP: SST complete, seqno: 0

163 improved sst/ist logging T10:11: Z 2 [Warning] WSREP: Gap in state sequence. Need state transfer T10:11: Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address ' :5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' ) :41: WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** :41: WSREP_SST: [ERROR] The xtrabackup version is Needs xtrabackup or higher to perform SST :41: WSREP_SST: [ERROR] ****************************************************** T10:11: Z 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address ' :5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' Read: '(null)' T10:11: Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address ' :5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' : 2 (No such file or directory) T10:11: Z 2 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable T10:11: Z 2 [ERROR] Aborting

164 improved wsrep stages wsrep-stages are important to track what the given thread is working on at any given time. MASTER node: 7408 root :49938 test Query 0 wsrep: pre-commit/certification passed (15406) COMMIT 0 0 Replicating/Slave node: 1569 system user NULL Sleep 0 wsrep: committed write set (14001) NULL 0 0 Let US RUN some quick workload (Say UPDATE_INDEX) and track some of the IMPORTANT stages that you may see OFTEN. exercise

165 pxc performance

166 pxc performance

167 pxc performance

168 pxc performance

169 pxc performance

170 pxc performance MAJOR MILESTONE IN PXC HISTORY. Performance improved for all kind of workload and configuration sync_binlog=0/1 log-bin=enabled/disabled OLTP/UPDATE_NON_INDEX/UPDATE_INDEX. Scale even with 1024 threads and 3 nodes.

171 pxc performance So what were the problems and how we solved it. We found 3 major issues that blocked exploring effective use of REDO log flush parallelism Group-Commit protocol (LEADER-FOLLOWER batch operation protocol)

172 pxc performance How commit operates: COMMIT (2 steps) prepare (mark trx as prepare, redo flush (if log-bin=0)). If operating in cluster-mode (PXC) replicate the transaction + enforce ordering. commit (flush bin logs + redo flush + commit innodb) PROBLEM-1: CommitMonitor was exercised such that REDO flush get serialized. Split replicate + enforce ordering PROBLEM-2: Group-Commit Logic is optimized for batch load but with Commit Ordering in PXC, only 1 transaction was forward to GC logic. Release Commit Ordering early and rely on GC-Logic to enforce ordering. (interim_commit) PROBLEM-3: Second fsync/redo-flush (with log-bin=0) was again serialized. Release ComMon as order is already enforced + trx is committed in memory. Only redo log flush is pending.

173 pxc + pmm

174 monitor pxc through pmm PXC is fully integrated with PMM. PMM helps monitor different aspect of PXC. Let s see some of the key aspect: How long can a node sustain without need for SST (projected based on historical data). Size of Replicated write-set. Conflict Basic statistical information (as to how many nodes are up and running). If node goes down that is reflected. Is node in DESYNCed state or FLOW-Control? Exercise: pmmdemo.percona.com exercise

175 misc

176 simplified packaging Starting PXC-5.7 we have simplified the packaging. PXC-5.7 is single package that you need to install for all PXC components (mainly pxc server + replication library). garbd continue to be remains as separate package. External tools like xtrabackup, proxysql are available from percona repo. PXC is widely supported on most of the ubuntu + debian + centos platform.

177 mysql/ps compatible PXC is MySQL/PS compatible and we normally continue to keep it refresh with major release of MySQL version (typically once per quarter). PXC is compatible with Codership Cluster (5.7 GAed few months back).

178 tips to run successful cluster

179 tips to run successful cluster TIP-1: All nodes of the cluster should have same configuration settings (except those setting that are h/w dependent or node specific) For example: wsrep_max_ws_size is used to control the large transaction, gcs.fc_limit (configured through wsrep_provider_options), wsrep_replicate_myisam (pxc-strict-mode doesn t allow it any more), etc You should review the default setting and adjust it as per your environment but avoid changing settings that you are not sure about. For example: wsrep_sync_wait helps read the fresh data but increases SELECT latency. It is recommended to set this only if there is need for fresh data.

180 tips to run successful cluster TIP-2: Ensure timeouts are configured as per the your environment. Default timeout set in PXC works in most cases (mainly for LAN setup) but each n/w setup is different and so understand the latency, server configuration, load server need to handle and accordingly set the timeouts. PXC loves stable network connectivity. So try to install PXC on server with stable network connectivity. GARBD may not need a high-end configuration server but still need a good network connectivity.

181 tips to run successful cluster TIP-3: Monitor your PXC cluster regularly. Monitoring is must for any system for its smooth operation and PXC is not an exception. Plan outages carefully, if you expect a sudden increase workload for small time make sure enough resources are allocated for PXC to handle the increased workload (say wsrep_slave_threads). Ensure that nodes doesn t keep on leaving the cluster on/off without any know reason. Don t boot multiple nodes to join through SST all of sudden. One at a time. (Node is offered membership first and then data transfer begin).

182 tips to run successful cluster TIP-4: Understand your workload Workload of each user is different and it is important to understand what kind of workload your application is generating. One of such user observed that his cluster drags at some point of day and then it take good time for the cluster to catch-up. Application of user was firing ALTER command on a large table while the DML was in progress causing complete stall of DML command. On restoration DML workload continued and slaved failed to catch-up due to insufficient resources (read slave-thread). Avoid use of experimental features (strict-mode will anyway block them) Double check the semantics while using commands like FTWRL, RSU, Events.

183 tips to run successful cluster TIP-5: what to capture if you hit an issue Clear description of what kind of setup you had (3 node cluster, 2 nodes + garbd, master replicating to cluster, slave, etc.) my.cnf + command line options + show-status + any other configuration that you know you have modified. (If they are different for different nodes then capture them for all the nodes). Complete error.log from all of the nodes. (If it is involves SST failure then supporting error-logs are present in data-dir (or hidden folder.sst)) What was state of cluster while it was working fine and rough idea of workload that was active that pushed cluster in the current state.

184 thanks for attending

connect with us We would love to hear what more you want from PXC or what problem you face in using PXC and would be happy to assist you or take a feature request and put it on product roadmap.

185 connect with us We would love to hear what more you want from PXC or what problem you face in using PXC and would be happy to assist you or take a feature request and put it on product roadmap. connect with us forum: krunal.bauskar@percona.com, matthew.boehm@percona.com drop in at Percona Booth catch us during conference

Migrating to XtraDB Cluster 2014 Edition

Migrating to XtraDB Cluster 2014 Edition Jay Janssen Managing Consultant Overview of XtraDB Cluster Percona Server + Galera Cluster of Innodb nodes Readable and Writable Virtually Synchronous All data