IBM InfoSphere Data Replication s Change Data Capture (CDC) Fast Apply IBM Corporation

IBM InfoSphere Data Replication s Change Data Capture (CDC) Fast Apply

Agenda - Overview of Fast Apply - When to use Fast Apply - The available strategies & when to use - Common concepts - How to configure Fast Apply - Limitations - Documentation 2 2

Overview of Fast Apply - Fast Apply is a collection of algorithms that can be used to increase throughput for a subscription - Each algorithm is designed for a particular pattern of change data - You need to know something about the changes that will be replicated and the relationships between the target tables to ensure that you select the algorithm appropriate for the workload - Fast Apply will often use more resources per change row than the CDC default apply, so should only be used when it is necessary to meet business requirements 3 3

When to use Fast Apply - Only use Fast Apply when the default apply can not meet the business requirements for latency due to bottlenecks occurring in the target database apply - The Performance monitoring and tuning guide provides instructions on how to use the Performance Monitor to make this determination - Latency must be increasing because CDC cannot keep up with the source database (ie. latency increases during heavy load on the source) 4 4

The four strategies currently available - Group by table - Parallelize by single table - Parallelize by table - Parallelize single table by hash 5 5

Group By Table CDC will reorder a set of operations creating lists of operations for each table, and then attempt to apply them to the target system INSERT TABLE2 UPDATE TABLE1 INSERT TABLE3 INSERT TABLE2 INSERT TABLE3 Applied as UPDATE TABLE1 INSERT TABLE2 INSERT TABLE2 INSERT TABLE3 INSERT TABLE3 Reordering the operations allows CDC to use the JDBC batch facility for the inserts for both TABLE2 and TABLE3 6 6

When to use Group By Table It is best if there are no referential integrity constraints identified in the target database for the tables as this creates no risk of getting apply errors from the re-ordering It is best if there tends to be series of similar (IUD) operations for a given table. For example, if the application tends to create transactions where a new row is inserted and then immediately updated to fill in the extra columns then this strategy would not create an opportunity for batching operations. This algorithm can still provide benefit even when there is referential integrity as long as the referential integrity exists on the source as well, but this requires more knowledge of the behavior of the source application CDC will apply the operations in the order that they first appear in the unit of work If the application tends to manipulate both the parent and child in the same transaction then the appropriate one will always appear first 7 7

When to use Group By Table (continued) If the CDC target engine is the bottleneck which is an indication that the image builder is the bottleneck Group By Table will utilize multiple image builders to format the SQL statements By default it utilizes four image builders 8 8

Parallelize By Single Table This assumes that there is only a single table mapping in the subscription and that there are no dependencies between operations. It is designed for mappings where only inserts are done Applied as Database connections will always be evenly loaded 9 9

When to use Parallelize By Single Table Must have a single table mapping in the subscription You should be certain that the source application only inserts into the table or that you are using a Live Audit subscription The target database shouldn t be resource constrained 10 10

Parallelize By Table Similar to the Group by table, but instead of applying the reordered operations on a single database connection, the operations are applied concurrently across multiple database connections INSERT TABLE2 UPDATE TABLE1 INSERT TABLE3 INSERT TABLE2 INSERT TABLE3 Applied as UPDATE TABLE1 INSERT TABLE2 INSERT TABLE2 INSERT TABLE3 INSERT TABLE3 Parallel database connections allow for greater throughput into the target database This can also allow JDBC batch to be used 11 11

Parallelize By Table (continued) Note that there is no attempt to balance the load. The tables are assigned to a database connection in the order they are seen in the stream of changes: INSERT TABLE2 UPDATE TABLE1 INSERT TABLE3 INSERT TABLE4 INSERT TABLE4 Applied as UPDATE TABLE1 INSERT TABLE4 INSERT TABLE4 INSERT TABLE2 INSERT TABLE3 12 12

When to use Parallelize By Table Multiple tables in the subscription Ideally there would be no referential integrity constraints defined in the target database Ideally there would be a number of tables which tend to have a significant volume of changes so that the database connections will tend to be evenly loaded The target database shouldn t be resource constrained 13 13

Parallelize Single Table By Hash Similar to Parallelize By Table, but a hash will be used to distribute to the different threads so that dependent operations will be processed by the same thread Operations with the same key value will be assigned to the same database connection and in the same order as performed on the source system The key value is determined from the key columns identified for the target table mapping A hash function on the key values(rather than key values themselves) is used for distribution to more easily support multi column keys and to more evenly distribute the operations across the available threads Hash code is generated based on the target replication keys Then a modulo with the number of threads configured is used to select the apply thread that the operation is routed to 14 14

Parallelize Single Table By Hash (continued) Example One: INSERT Table A KEY1 INSERT Table B KEY2 UPDATE Table A KEY1 INSERT Table C KEY3 INSERT Table D KEY4 DELETE Table C KEY3 Applied as INSERT Table A KEY1 UPDATE Table A KEY1 INSERT Table B KEY2 INSERT Table D KEY4 INSERT Table C KEY3 DELETE Table C KEY3 15 15

Parallelize Single Table By Hash (continued) Example Two: INSERT Table A KEY1 INSERT Table B KEY2 UPDATE Table A KEY1 INSERT Table C KEY3 INSERT Table A KEY5 UPDATE Table A KEY5 INSERT Table A KEY6 INSERT Table D KEY4 DELETE Table C KEY3 UPDATE Table A KEY6 Applied as INSERT Table A KEY1 UPDATE Table A KEY1 INSERT Table B KEY2 INSERT Table A KEY6 INSERT Table D KEY4 UPDATE Table A KEY6 INSERT Table C KEY3 INSERT Table A KEY5 UPDATE Table A KEY5 DELETE Table C KEY3 16 16

When to use Parallelize Single Table By Hash Unlike what the name implies, Parallelize Single Table By Hash works in subscriptions with one or more tables Parallelize By Hash will ensure that any operations that are dependent on each other, i.e. are on the same row of the same table, will be applied by the same thread in the original order Only operations that are not dependent on each other will be applied by different threads Non-dependent operations are not guaranteed to be applied in the same order This is often useful in the following cases: Where a single table has been placed in its own subscription already because of either a high data volume or a relatively slow apply speed When there is very uneven volume amongst the tables and one or two tables have the majority of activity 17 17

When not to use Parallelize Single Table By Hash Parallelize Single Table By Hash may not be appropriate under the following conditions: If the target database is resource constrained Can not used when a row may have its key value updated When the activity of changes will not all resolve to the same keys for a given table There may occasionally be units of work where the hash codes generated for the key values may skew the distribution across the threads so that some threads have significantly more work than others For Example, if all changes resolve to the same keys for a given table, the activity would be skewed to one thread This is expected to be rare 18 18

Common Concepts: Unit of work All of the fast apply strategies operate on a Unit Of Work (UOW) consisting of several source transactions A UOW will always end at a source transaction boundary Creating a larger UOW will tend to cause higher latency than when CDC is keeping up using its default apply For the algorithms that utilize several database connections CDC will commit across all of these connections at the same time but not atomically Each thread will complete applying their changes before any thread begins to commit The bookmark will only be written when the master thread commits A query run against the target database may temporarily see just part of the UOW 19 19

Common Concepts: Unit of work (continued) When a subscription ends normally it will always complete the current UOW When a subscription ends in an abnormal fashion, it may be that only some of the connections will have committed CDC will ensure that the incomplete unit of work is completed when the subscription is restarted The UOW threshold value works like the other threshold values around grouping transactions. If there is latency then CDC will create units of work based on this value to maximize throughput and reduce latency. However, when there is no latency CDC will use smaller units of work to ensure it is not artificially adding latency. Building a larger UOW requires more memory for the CDC instance Typically a minimum of 4Gb memory is required to use fast apply When performance tuning, there is a balance between the size of the UOW, and the memory required (and the time utilized by garbage collection) 20 20

Common Concepts: Unit of work (continued) For the fast apply algorithms you specify the number of operations (insert/updates/deletes) for the UOW CDC will stop adding additional transactions into the UOW once it has reached the threshold value you specified As such, CDC will only initiate the commit of a UOW at a source transaction boundary If CDC is building the UOW and has hit the threshold of 2 x (two times) the UOW value specified, and has not seen a source commit, the fast apply will be abandoned, and the large transaction will be moved and applied on a single thread and applied serially in the original order If a UOW is abandoned, there is a message in the CDC target trace files that states exceeds the maximum UOW size If the UOW is abandoned frequently, increase the UOW size specified, which will also likely require additional memory allocated to the CDC instance 21 21

Common Concepts: Optimistic strategy CDC will attempt to use the defined strategy for a unit of work but if apply errors occur it will rollback all that work and retry that unit of work using the default strategy (in the original source order over a single database connection). This retry/fallback has a performance impact. If it occurs often enough then the Fast Apply might ultimately provide lower throughput than the default CDC apply There is no risk of data loss except in the case where deferred constraint checking has been enabled (noted as a limitation) 22 22

Parallelize By Table with UOW Concept Below is an example that illustrates how the fast apply with Parallelize by Table works as it builds a Unit Of Work (UOW) before it commits. The example has three transactions being processed (Tx1, Tx2, Tx3) For illustration, lets say that CDC groups Tx1, Tx2, and Tx3 together into a single transaction Tx1: UPDATE TABLE1 Thread 1 Tx1: INSERT TABLE2 Tx2: UPDATE TABLE1 Tx3: INSERT TABLE3 Tx3: INSERT TABLE2 Applied as INSERT TABLE2 INSERT TABLE2 Thread 2 Thread 3 Tx3: INSERT TABLE3 INSERT TABLE3 INSERT TABLE3 23 23

Parallelize By Table with UOW Concept (continued) Once CDC sees a source commit, and has passed the UOW threshold set for fast apply, each slave thread will commit, and then the master apply thread would commit (along with writing the CDC bookmark) Thus, if in the example thread 1 is the master thread, then what you would see is thread 2 commit, then thread 3 commit, and then finally master thread 1 would commit For any given table, the source transaction sequence is maintained Commit boundaries are different than the source, until the master thread commits When the master thread commits, the target will be at a consistent commit boundary with the source 24 24

Configuring Select a Fast Apply strategy by identifying a special subscription-level user exit Example: com.datamirror.ts.target.publication.userexit.fastapply.groupbytable You specify the unit of work size and the number of parallel database connections via the parameter for the user exit Eg. 3:10000 3 database connections and a unit of work of 10,000 operations For Group By Table you just specify the unit of work size E.g 25000 a unit of work of 25,000 operations Note that you should only change modes after you ve ended the subscription normally (so that it wasn t left with an incomplete unit of work applied to the target database) 25 25

Limitations (continued) Care must be taken if triggers are used on the target tables The general rule is that the trigger can't read or modify rows that could be being read or modified by a different apply thread This is because other tables are being modified asynchronously by other apply threads and the trigger may not see the operations in the order it requires or the trigger could cause locking conflicts 27 27

Documentation Documentation on fast apply is available in the Performance Monitoring and Tuning Guide The guide provides a process for identifying and resolving latency issues The Fast Apply technology is described in the section on resolving target database bottlenecks 28 28

Additional Resources IBM Developer Works CDC community: https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/community view?communityuuid=a9b542e4-7c66-4cf3-8f7b-8a37a4fdef0c IBM CDC Information Center: http://www- 01.ibm.com/support/knowledgecenter/SSTRGZ_11.3.0/com.ibm.idr.frontend.doc/pv_welc ome.html CDC Redbook: http://www.redbooks.ibm.com/redbooks.nsf/redbookabstracts/sg247941.html?open IBM CDC Support: http://www- 947.ibm.com/support/entry/portal/product/information_management/infosphere_change_ data_capture?productcontext=-873715215 Passport Advantage: https://www-112.ibm.com/software/howtobuy/softwareandservices/passportadvantage 29

Legal Disclaimer IBM Corporation 2014. All Rights Reserved. The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete: All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus Sametime Unyte ). Subsequent references can drop IBM but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the or symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. If you reference Adobe in the text, please mark the first use and include the following; otherwise delete: Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. If you reference Java in the text, please mark the first use and include the following; otherwise delete: Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. If you reference Microsoft and/or Windows in the text, please mark the first use and include the following, as applicable; otherwise delete: Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. If you reference Intel and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete: Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. If you reference UNIX in the text, please mark the first use and include the following; otherwise delete: UNIX is a registered trademark of The Open Group in the United States and other countries. If you reference Linux in your presentation, please mark the first use and include the following; otherwise delete: Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only. 31