Using MDM Big Data Relationship Management to Perform the Match Process for MDM Multidomain Edition

Using MDM Big Data Relationship Management to Perform the Match Process for MDM Multidomain Edition Copyright Informatica LLC 1993, 2017. Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners

Abstract You can use MDM Multidomain Edition to create and manage master data. The MDM Hub uses multiple batch processes including the match process to create master data. You can use MDM Big Data Relationship Management to perform the match process on a large data set. The MDM Hub can leverage the scalability of MDM Big Data Relationship Management to perform the match process on the initial data and the incremental data. This article describes how to use MDM Big Data Relationship Management to perform the match process for MDM Multidomain Edition. Supported Versions MDM Big Data Relationship Management 10.0 HotFix 2-10.0 HotFix 6 PowerCenter 9.6.1 HotFix 2 and 9.6.1 HotFix 3 PowerExchange for Hadoop 9.6.1 HotFix 2 and 9.6.1 HotFix 3 MDM Multidomain Edition 9.7.0 and 9.7.1 Table of Contents Overview.... 3 Match Process in the MDM Hub.... 3 Match Process with MDM Big Data Relationship Management.... 3 Scenario.... 4 Migrate Data from the Hub Store to HDFS.... 4 Step 1. Import Source Definitions for the Base Objects in the Hub Store.... 5 Step 2. Create a Mapping.... 5 Step 3. Create an HDFS Connection.... 7 Step 4. Run the Mapping.... 8 Create the Configuration and Matching Rules Files.... 8 Run the Extractor Utility.... 9 Run the Initial Clustering Job.... 10 Output Files of the Initial Clustering Job.... 11 Migrate the Match-Pair Data from HDFS to the Hub Store.... 11 Step 1. Create a Source Definition for the Match-Pair Output Data.... 12 Step 2. Create a Mapping.... 12 Step 3. Run the Mapping.... 13 Updating the Consolidation Indicator of the Matched Records.... 16 Updating the Consolidation Indicator by Running an SQL Query.... 16 Updating the Consolidation Indicator by Using the Designer and the Workflow Manager.... 16 Manage the Incremental Data.... 19 Step 1. Migrate the Incremental Data from the Hub Store to HDFS.... 19 Step 2. Run the Initial Clustering Job in the Incremental Mode.... 20 Step 3. Migrate the Match-Pair Data from HDFS to the Hub Store.... 21 2

Step 4. Update the Consolidation Indicator of the Matched Records.... 21 Overview You can use MDM Multidomain Edition to create and manage master data. Before you create master data, you must load data into the MDM Hub and process it through multiple batch processes, such as land, stage, load, tokenize, match, consolidate, and publish processes. The match process identifies duplicate records and determines the records that the MDM Hub can automatically consolidate and the records that you must manually review before consolidation. For more information about MDM Multidomain Edition and the batch processes, see the MDM Multidomain Edition documentation. You can use MDM Big Data Relationship Management to perform the match process on a large data set in a Hadoop environment. The MDM Hub can leverage the scalability of MDM Big Data Relationship Management to perform the match process, and you can continue to use the MDM Hub to manage master data. For more information about MDM Big Data Relationship Management, see the Informatica MDM Big Data Relationship Management User Guide. Match Process in the MDM Hub You can run the Match job in the MDM Hub to match and merge the duplicate data that the job identifies. The following image shows how the Match job functions: The Match job performs the following tasks: 1. Reads data from the base objects in the Hub Store. 2. Generates match keys for the data in the base object and stores the match keys in the match key table associated with the base object. 3. Uses the match keys to identify the records that match. 4. Updates the Match table with references to the matched record pairs and related information. 5. Updates the consolidation indicator of the matched records in the base objects to 2. The indicator 2 indicates the QUEUED_FOR_MERGE state, which means that the records are ready for a merge. Match Process with MDM Big Data Relationship Management MDM Big Data Relationship Management reads data from HDFS, performs the match process, and creates the matchpair output in HDFS. Use PowerCenter with PowerExchange for Hadoop to migrate data from the Hub Store to HDFS and from HDFS to the Hub Store. The following image shows how you can migrate data from the Hub Store to HDFS: 3

To perform the match process with MDM Big Data Relationship Management, perform the following tasks: 1. Use PowerCenter with PowerExchange for Hadoop to migrate data from the base objects in the Hub Store to HDFS in the fixed-width format. 2. Run the extractor utility of MDM Big Data Relationship Management to generate the configuration file and the matching rules file. 3. Run the initial clustering job of MDM Big Data Relationship Management that performs the following tasks: a. Reads the input data from HDFS. b. Indexes and groups the input data based on the rules in the matching rules file. c. Matches the grouped records and links the matched records. d. Creates the match-pair output files in HDFS that contain a list of records and their matched records. 4. Use PowerCenter with PowerExchange for Hadoop to migrate the match-pair output data to the Match table in the Hub Store. 5. Run an SQL query or use PowerCenter with PowerExchange for Hadoop to update the consolidation indicator of the matched records in the base objects to 2. The indicator 2 indicates the QUEUED_FOR_MERGE state, which means that the records are ready for a merge. Scenario You work for a large healthcare organization that has a large data set as the initial data. You want to use MDM Multidomain Edition to create and manage master data. You also want to leverage the scalability of MDM Big Data Relationship Management to perform the match process. Migrate Data from the Hub Store to HDFS Use the PowerCenter Designer and the PowerCenter Workflow Manager to migrate data from the base objects in the Hub Store to HDFS. 1. Import source definitions for the base objects in the Hub Store. 2. Create a mapping with the source instances, a Joiner transformation, and a target definition. 3. Create an HDFS connection. 4

4. Run the mapping to migrate data from the base objects in the Hub Store to HDFS. Step 1. Import Source Definitions for the Base Objects in the Hub Store Use the Designer to import a relational source definition for the base object in the Hub Store. A relational source definition extracts data from a relational source. If you want to integrate data from more than one base object, create a relational source definition for each base object. For example, if you have separate base objects for person names and addresses, create two relational source definitions, one for the person names and another for the addresses. You can use a Joiner transformation to integrate the person names and the address details. The following image shows the source definitions for the person names and addresses base objects: Step 2. Create a Mapping Use the Designer to create a mapping. A mapping represents the data flow between the source and target instances. 1. Create a mapping and add the source definitions that you import. When you add the source definitions to the mapping, the Designer adds a Source Qualifier transformation for each source definition. 2. Set the value of the Source Filter property to CONSOLIDATION_IND <> 9 for each Source Qualifier transformation. The value of the Source Filter property indicates to migrate data that has the consolidation indicator not equal to 9. The following image shows how to configure the Source Filter property for a Source Qualifier transformation: 5

Note: You can also set additional conditions to configure filters for a match path. The filter of a match path includes or excludes records for matching based on the values of the columns that you specify. 3. Create a Joiner transformation. A Joiner transformation joins data from two or more base objects in the Hub Store. The Joiner transformation uses a condition that matches one or more pairs of columns between the base objects. 4. Link the ports between the Source Qualifier transformations and the Joiner transformation. 5. Configure a condition to integrate data from the base objects. The following image shows a condition based on which the Joiner transformation integrates data: 6

6. Create a target definition based on the Joiner transformation. When you create a target definition from a transformation, the type of target is the same as the source instance by default. You must update the writer type of the target definition to HDFS Flat File Writer in the Workflow Manager. 7. Link the ports between the Joiner transformation and the target definition. The following image shows a mapping that contains two source instances, a Joiner transformation, and a target definition: 8. Validate the mapping. The Designer marks a mapping as not valid when it detects errors. Step 3. Create an HDFS Connection Use the Workflow Manager to create an HDFS connection. You can use the HDFS connection to access the Hadoop cluster. 7

Step 4. Run the Mapping Use the Workflow Manager to run a mapping. 1. Create a workflow and configure the PowerCenter Integration Service for the workflow. 2. Create a Session task and select the mapping. 3. Edit the session to update the following properties: Writer type of the target definition to HDFS Flat File Writer. HDFS connection details for the target definition. The following image shows the properties of a target definition: 4. Start the workflow to integrate the source data and write the integrated data to the flat file in HDFS. Create the Configuration and Matching Rules Files You must create a configuration file and a matching rules file in the XML format to run the MapReduce jobs of MDM Big Data Relationship Management. A configuration file defines parameters related to the Hadoop distribution, repository, SSA-NAME3, input record layout, and metadata. A matching rules file defines indexes, searches, and matching rules for each index. Use one of the following methods to create the configuration file and the matching rules file: Customize the MDMBDRMTemplateConfiguration.xml sample configuration file and the MDMBDRMMatchRuleTemplate.xml sample matching rules file located in the following directory: /usr/local/ mdmbdrm-<version Number>/sample Run the following extractor utility to generate the configuration and matching rules files: On MDM Big Data Relationship Management 10.0 HotFix 2-10.0 HotFix 5. BDRMExtractConfig.jar On MDM Big Data Relationship Management 10.0 HotFix 6 or later. config-tools.jar 8

Run the Extractor Utility You can run the extractor utility to generate a configuration file and a matching rules file. The utility extracts the index, search, and matching rules definition from the Hub Store and adds the definitions to the matching rules file. The utility also creates a configuration file to which you must manually add the parameters related to Hadoop distribution, repository, input record layout, and metadata. Use the following extractor utility based on the MDM Big Data Relationship Management version: On MDM Big Data Relationship Management 10.0 HotFix 2-10.0 HotFix 5. /usr/local/mdmbdrm-<version Number>/bin/BDRMExtractConfig.jar On MDM Big Data Relationship Management 10.0 HotFix 6 or later. /usr/local/mdmbdrm-<version Number>/bin/config-tools.jar Use the following command to run the BDRMExtractConfig.jar extractor utility: java -Done-jar.class.path=<JDBC Driver Name> -jar <Extractor Utility Name> <Driver Class> <Connection String> <User Name> <Password> <Schema Name> <Rule Set name> Use the following command to run the config-tools.jar extractor utility: java -cp <JDBC Driver Name>:<Extractor Utility Name>:. com.informatica.mdmbde.driver.extractconfig <Driver Class> <Connection String> <User Name> <Password> <Schema Name> <Rule Set name> Use the following parameters to run the utility: Extractor Utility Name Absolute path and file name of the extractor utility. You can use one of the following values based on the MDM Big Data Relationship Management version: On MDM Big Data Relationship Management 10.0 HotFix 2-10.0 HotFix 5. /usr/local/mdmbdrm- <Version Number>/bin/BDRMExtractConfig.jar On MDM Big Data Relationship Management 10.0 HotFix 6 or later. /usr/local/mdmbdrm-<version Number>/bin/config-tools.jar JDBC Driver Name Absolute path and name of the JDBC driver. Use one of the following values based on the database you use for the Hub Store: Oracle. ojdbc6.jar Microsoft SQL Server. sqljdbc.jar IBM DB2. db2jcc.jar Driver Class Driver class for the JDBC driver. Use one of the following values based on the database you use for the Hub Store: Oracle. oracle.jdbc.driver.oracledriver Microsoft SQL Server. com.microsoft.sqlserver.jdbc.sqlserverdriver IBM DB2. com.ibm.db2.jcc.db2driver Connection String Connection string to access the metadata in the Hub Store. 9

Use the following format for the connection string based on the database you use for the Hub Store: Oracle. jdbc:informatica:oracle://<host Name>:<Port>;SID=<Database Name> Microsoft SQL Server. jdbc:informatica:sqlserver://<host Name>:<Port>;Databasename=<Database Name> IBM DB2. jdbc:informatica:db2://<host Name>:<Port>;Databasename=<Database Name> Host Name indicates the name of the server on which you host the database, Port indicates the port through which the database listens, and Database Name indicates the name of the Operational Reference Store (ORS). User Name Password Schema User name to access the ORS. Password for the user name to access the ORS. Name of the ORS. Rule Set Name Name of the match rule set. The utility retrieves the match definitions from the rule set and writes the definitions to the matching rules file. The following sample command runs the config-tools.jar extractor utility in an IBM DB2 environment: java -cp /root/db2jcc.jar:/usr/local/mdmbdrm-<version Number>/bin/config-tools.jar:. com.informatica.mdmbde.driver.extractconfig com.ibm.db2.jcc.db2driver jdbc:db2:// mdmauto10:50000/devut DQUSER Password1 DQUSER TP_Match_Rule_Set The following sample command runs the BDRMExtractConfig.jar extractor utility in an IBM DB2 environment: java -Done-jar.class.path=/root/db2jcc.jar -jar /usr/local/mdmbdrm-<version Number>/bin/ BDRMExtractConfig.jar com.ibm.db2.jcc.db2driver jdbc:db2://mdmauto10:50000/devut DQUSER Password1 DQUSER TP_Match_Rule_Set Run the Initial Clustering Job An initial clustering job indexes and links the input data in HDFS and persists the indexed and linked data in HDFS. The initial clustering job also creates the match-pair output files in HDFS that contain a list of records and their matched records. To run the initial clustering job, on the machine where you install MDM Big Data Relationship Management, run the run_genclusters.sh script located in the following directory: /usr/local/mdmbdrm-<version Number> Use the following command to run the run_genclusters.sh script in the initial mode: run_genclusters.sh --config=configuration_file_name --input=input_file_in_hdfs [--reducer=number_of_reducers] --hdfsdir=working_directory_in_hdfs --rule=matching_rules_file_name [--matchinfo] 10

The following table describes the options and the arguments that you can specify to run the run_genclusters.sh script in the initial mode: Option Argument Description --config configuration_file_name Absolute path and file name of the configuration file that you create. --input input_file_in_hdfs Absolute path to the input files in HDFS. --reducer number_of_reducers Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1. --hdfsdir working_directory_in_hdfs Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files. --rule matching_rules_file_name Absolute path and file name of the matching rules file that you create. --matchinfo Optional. Indicates to add the match score against each matched record. Use --matchinfo option only if you want to apply any post-match rules on the match-pair output data before you migrate the data to the Match table. For example: The following command runs the initial clustering job in the initial mode: run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/source10million -- reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --matchinfo Output Files of the Initial Clustering Job An initial clustering job indexes and links the input data in HDFS and persists the indexed and linked data in HDFS. The initial clustering job also creates the match-pair output files in HDFS that contain a list of records and their matched records. The format of the data in the match-pair output files is same as the Match table in the Hub Store. You must migrate the data in the match-pair output files to the Match table in the Hub Store. You can find the output files in the following directory: Linked records. <Working Directory of Initial Clustering Job in HDFS>/BDRMClusterGen/<Job ID>/ output/dir/pass-join Match-pair output files. <Working Directory of Initial Clustering Job in HDFS>/BDRMClusterGen/<Job ID>/output/dir/match-pairs Each initial clustering job generates a unique ID, and you can identify the job ID based on the time stamp of the <Job ID> directory. You can also use the output log of the initial clustering job to find the directory path of its output files. Migrate the Match-Pair Data from HDFS to the Hub Store Use the Designer and the Workflow Manager to migrate data from HDFS to the Match table in the Hub Store. 1. Import a source definition for the match-pair output file. 2. Create a mapping with the source instance, an Expression transformation, and a target definition. 3. Run the mapping to migrate the match-pair data to the Match table in the Hub Store. 11

Step 1. Create a Source Definition for the Match-Pair Output Data Use the Designer to import a source definition for the match-pair output files. You can use a single source instance in the mapping even if you have multiple match-pair output files that have same columns. During the session run time, you can concatenate all the match-pair output files. Step 2. Create a Mapping Use the Designer to create a mapping. A mapping represents the data flow between the source and target instances. 1. Create a mapping and add the source definition that you import to the mapping. When you add a source definition to the mapping, the Designer adds a Source Qualifier transformation for the source definition. 2. Create an Expression transformation. 3. Link the ports between the Source Qualifier transformation and the Expression transformation. 4. Set the date format of all the date columns in the source to the following format: dd-mon-yyyy hh24:mi:ss The following image shows how to set the date format for a date column: 5. Create a target definition based on the Expression transformation. 12

When you create a target definition from a transformation, the target type is the same as the source instance by default. You must update the writer type of the target definition to Relational Writer in the Workflow Manager. 6. Link the ports between the Expression transformation and the target definition. The following image shows a mapping that migrates data from HDFS to the Match table in the Hub Store: 7. Validate the mapping. The Designer marks a mapping as not valid when it detects errors. Step 3. Run the Mapping Use the Workflow Manager to run a mapping. 1. Create a workflow and configure the Integration Service for the workflow. 2. Create a Session task and select the mapping. 3. Edit the session to update the writer type of the target definition to Relational Writer. The following image shows how to configure the write type of the target definition: 13

4. If the match-pair output contains multiple files, perform the following steps: a. Create a parameter file with.par extension and add the following entry to the file: [Global] $$Param_Filepath=<Match-Pair Output Directory> [Parameterization.WF:<Workflow Name>.ST:<Session Task Name>] Match-Pair Output Directory indicates the absolute path to the directory that contains the match-pair output, Workflow Name indicates the name of the workflow, and Session Task Name indicates the name of the Session task that you create. For example: [Global] $$Param_Filepath=/usr/hdfs/workingdir/BDRMClusterGen/MDMBDRM_931211654144593570/ output/dir/match-pairs [Parameterization.WF:wf_Mapping2_write_match_pairs_MTCH.ST:s_Mapping1a_write_match_pa irs_mtch] Create the parameter file to use a single source instance for all the match-pair output files. b. Edit the session to update the following properties for the parameter file: Absolute path and the file name of the parameter file that you create. The following image shows how to specify the absolute path and the file name of the parameter file: 14

File Path property to $$Param_Filepath for the source qualifier. The following image shows how to set the File Path property for a source qualifier: 5. Define the $$Param_Filepath variable for the workflow. The following image shows how to define a variable that you use in the workflow: 6. Start the workflow to migrate the match-pair output data in HDFS to the Match table in the Hub Store. 15

Updating the Consolidation Indicator of the Matched Records In the MDM Hub, the Match job loads the matched record pairs into the Match table and sets the consolidation indicator of the matched records to 2 in the base objects. The indicator 2 indicates the QUEUED_FOR_MERGE state, which means that the records are ready for a merge. Similarly, after you migrate the match-pair output data from HDFS to the Hub Store, you must update the consolidation indicator of the matched records in the base objects to 2. To update the consolidation indicator of the matched records in the base objects, use one of the following methods: Run an SQL query. Use the Designer and the Workflow Manager. For more information about the consolidation indicator, see the Informatica MDM Multidomain Edition Configuration Guide. Updating the Consolidation Indicator by Running an SQL Query You can run an SQL query to update the consolidation indicator of the matched records to 2 in the base objects. Use the following SQL query to update the consolidation indicator in the base objects: Update <Base Object> set consolidation_ind = 2 where consolidation_ind = 3 The consolidation indicator 2 indicates the QUEUED_FOR_MERGE state and 3 indicates the NOT MERGED or MATCHED state. For example, Update C_RL_PARTY_GROUP set consolidation_ind = 2 where consolidation_ind = 3 Updating the Consolidation Indicator by Using the Designer and the Workflow Manager You can use the Designer and the Workflow Manager to update the consolidation indicator of the matched records in the base objects.. 1. Import a source definition for the Match table. 2. Import a target definition for the base object. 3. Create a mapping with two instances of the source definition, a Union transformation, an Expression transformation, an Update Strategy transformation, and a target definition. 4. Run the mapping to update the consolidation indicator of the matched records to 2 in the base object. Step 1. Create Source Definitions for the Match Table Use the Designer to import a source definition for the Match table. Based on the data source, you can create the source definition. For example, if the data source is Oracle, create a relational source. Step 2. Create a Target Definition for the Base Object Use the Designer to import a target definition for the base object in the Hub Store from which you migrate data to HDFS. Based on the data source, you can create the target definition. For example, if the data source is Oracle, create a relational target. 16

Step 3. Create a Mapping Use the Designer to create a mapping. A mapping represents the data flow between the source and target instances. 1. Create a mapping and add two instances of the source definition that you import to the mapping. You can use the two source instances to compare the records and retrieve unique records. When you add the source definitions to the mapping, the Designer adds a Source Qualifier transformation for each source definition. 2. Add one of the following conditions to each Source Qualifier transformation: SELECT distinct <Match Table Name>.ROWID_OBJECT FROM <Match Table Name> SELECT distinct <Match Table Name>.ROWID_OBJECT_MATCHED FROM <Match Table Name> Note: Ensure that each Source Qualifier transformation contains a unique condition. The conditions retrieve records that have unique ROWID_OBJECT and ROWID_OBJECT_MATCHED values. The following image shows how to configure the condition for the Source Qualifier transformation: 3. Create a Union transformation. A Union transformation merges data from multiple pipelines or pipeline branches into one pipeline branch. 4. Create two input groups for the Union transformation. 5. Link the ports between the Source Qualifier transformations and the Union transformation. 6. Create an Expression transformation. Use the Expression transformation to set the consolidation indicator for the matched records. 7. Link the ports between the Union transformation and the Expression transformation. 17

8. Set the output_consolidation_ind condition to 2 for the Expression transformation. The following image shows how to set the output_consolidation_ind condition: 9. Create an Update Strategy transformation. Use the Update Strategy transformation to update the consolidation indicator of the matched records based on the value that you set in the Expression transformation. 10. Link the ports between the Expression transformation and the Update Strategy transformation. 11. Configure the DD_UPDATE constant as the update strategy expression to update the consolidation indicator. The following image shows how to configure the update strategy expression: 18

12. Add the target definition that you import to the mapping. 13. Link the ports between the Update Strategy transformation and the target instance. The following image shows a mapping that updates the consolidation indicator of the matched records in the base object to 2: 14. Validate the mapping. The Designer marks a mapping as not valid when it detects errors. Step 4. Run the Mapping Use the Workflow Manager to run a mapping. 1. Create a workflow and configure the Integration Service for the workflow. 2. Create a Session task and select the mapping. 3. Start the workflow to update the consolidation indicator of the matched records to 2 in the base object. Manage the Incremental Data After you migrate the match-pair output data from HDFS to the Hub Store, you can manage the incremental data for MDM Multidomain Edition. 1. Migrate the incremental data from the Hub Store to HDFS. 2. Run the initial clustering job in the incremental mode. 3. Migrate the match-pair output data from HDFS to the Match table in the Hub Store. 4. Update the consolidation indicator of the matched records in the base object to 2. Step 1. Migrate the Incremental Data from the Hub Store to HDFS Use the Designer and the Workflow Manager to migrate the incremental data from the Hub Store to HDFS. 1. Import source definitions for the base objects in the Hub Store. 2. Create a mapping with the source instances, a Joiner transformation, and a target definition for a flat file in HDFS. 3. Run the mapping to migrate the data from the Hub Store to HDFS. 19

Step 2. Run the Initial Clustering Job in the Incremental Mode An initial clustering job indexes and links the input data in HDFS and persists the indexed and linked data in HDFS. The initial clustering job also creates the match-pair output files in HDFS that contain a list of records and their matched records. To run the initial clustering job in the incremental mode, use the run_genclusters.sh script located in the following directory: /usr/local/mdmbdrm-<version Number> Use the following command to run the run_genclusters.sh script in the incremental mode: run_genclusters.sh --config=configuration_file_name --input=input_file_in_hdfs [--reducer=number_of_reducers] --hdfsdir=working_directory_in_hdfs --rule=matching_rules_file_name --incremental --clustereddirs=indexed_linked_data_directory [--consolidate] [--matchinfo] The following table describes the options and the arguments that you can specify to run the run_genclusters.sh script: Option Argument Description --config configuration_file_name Absolute path and file name of the configuration file that you create. --input input_file_in_hdfs Absolute path to the input files in HDFS. --reducer number_of_reducers Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1. --hdfsdir working_directory_in_hdfs Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files. --rule matching_rules_file_name Absolute path and file name of the matching rules file that you create. --incremental Runs the initial clustering job in the incremental mode. If you want to incrementally update the indexed and linked data in HDFS, run the job in the incremental mode. By default, the initial clustering job runs in the initial mode. 20

Option Argument Description --clustereddirs indexed_linked_data_directory Absolute path to the output files that the previous run of the initial clustering job creates. You can find the output files of the initial clustering job in the following directory: <Working Directory of Initial Clustering Job in HDFS>/ BDRMClusterGen/<Job ID>/output/dir/pass-join Each initial clustering job generates a unique ID, and you can identify the job ID based on the time stamp of the <Job ID> directory. You can also use the output log of the initial clustering job to find the directory path of its output files. --consolidate --matchinfo Consolidates the incremental data with the existing indexed and linked data in HDFS. By default, the initial clustering job indexes and links only the incremental data. Optional. Indicates to add the match score against each matched record. Use --matchinfo option only if you want to apply any post-match rules on the match-pair output data before you migrate the data to the Match table. For example, the following command runs the initial clustering job in the incremental mode: run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/source10million -- reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml -- clustereddirs=/usr/hdfs/workingdir/bdrmclustergen/mdmbdrm_931211654144593570/output/dir/passjoin --incremental --matchinfo You can find the output files of the initial clustering job in the following directory: Linked records. <Working Directory of Initial Clustering Job in HDFS>/BDRMClusterGen/<Job ID>/ output/dir/pass-join Match-pair output files. <Working Directory of Initial Clustering Job in HDFS>/BDRMClusterGen/<Job ID>/output/dir/match-pairs The format of the data in the match-pair output files is same as the Match table in the Hub Store. You must migrate the data in the match-pair output files to the Hub Store. Step 3. Migrate the Match-Pair Data from HDFS to the Hub Store Use the Designer and the Workflow Manager to migrate the match-pair data from HDFS to the Hub Store. 1. Import a source definition for the match-pair output file. 2. Create a mapping with the source instance, an Expression transformation, and a target definition for the Match table in the Hub Store. 3. Run the mapping to migrate the match-pair data from HDFS to the Match table in the Hub Store. Step 4. Update the Consolidation Indicator of the Matched Records Run an SQL query or use the Designer and the Workflow Manager to update the consolidation indicator of the matched records in the base objects. 21

Author Bharathan Jeyapal Lead Technical Writer Acknowledgements The author would like to thank Vijaykumar Shenbagamoorthy, Vinod Padmanabha Iyer, Krishna Kanth Annamraju, and Venugopala Chedella for their technical assistance. 22