Deduplication Records are deduplicated as they are read into the CRS but it is sometimes useful to deduplicate manually. There are two tabs within the deduplication module. 1. Deduplication: checks records in the segment against other records in the segment and flags up potential duplicates. 2. Central Deduplication: checks the records in the segment that have been sent to be published in Central against existing Central records and flags possible duplicates. Deduplication within a segment Creating a deduplication filter The deduplication process Merging duplicates one at a time Marking records as not duplicates Merging all records Select Mark all records. Notice that only one of the pair of records is marked Deleting a record CENTRAL deduplication Deduplication within a segment The first thing to do is decide on a strategy for deduplicating the records. There is a default filter built into the CRS, which checks the first 35 characters of the title, the year and the first four characters of the author. This filter is selected by default but it is also possible to edit this filter and choose specific criteria for deduplication. As many filters as are necessary can be created and stored within the local CRS installation. The default filter works well for most types of records, but sometimes it s useful to explore the other options to find duplicates that may not match the default criteria. Adding a Fuzzy search to the Title for example can help find records where the title may have been misspelt or has slightly different wording. Fuzzy searching is a lot slower so is not activated by default, however itis definitely worth exploring. Creating a deduplication filter 1. Go to View Deduplication Preferences on the toolbar 1
2. The deduplication preferences box will then appear. Click Create New and enter a name for the new filter, and then click OK. 3. Check the Match records by subtype box to match records by subtype before deduplicating. For example, if there are records in the register with the subtype BOOK, they will not be deduplicated against those records with the subtype JOURNAL. 4. Choose the Record category to be deduplicated - References or Studies from the drop down menu. 5. Pick the fields that will give the best chance of a match by checking the box next to the field. 2
The title field may be the most useful. Options are given to remove punctuation and strip out any brackets, which will increase the chances of finding a match in those records where the punctuation may be slightly different. These options can be selected by checking the boxes. The search can also be restricted to a specific number of characters, so if there are a lot of very long titles which start with the same 35 characters it might be worth increasing that value to check those records. A fuzzy search goes through all the records and compares each title, for example, with every other title, one at a time. It works out how different the two titles are and then decides whether they are so different that they are unlikely to be the same value with some misspellings in one or whether they potentially ought to be the same value. The precision setting determines the cut-off point for making that decision and the default setting of 38 characters has been found to be a good option for general use. When matching on Author there is the option to search on the Author surname only, which will ignore any author initials in the field. So for example Chen XI would match with Chen I, which may be correct, but of course it would also match Chen XP, which is unlikely to be correct. It s worth experimenting with this feature to find the optimum match.. 3
If Year of publication is included as part of the deduplication filter, just the year can be extracted, leaving out any months, e.g. 1995 Jan would match with 1995. In practice this is likely to be the best setting because it would not miss a potential duplicate because one record had month and the other didn t. 6. Once all the required fields have been selected, click Save to save the filter. 7. The filter will now appear in the drop down menu at the top of the screen for future use. 4
The deduplication process 1. Once a filter has been selected or created, the deduplication process can begin. Choose which fields are displayed on the results screen by selecting Table Template. 2. The Custom Table Template box will then appear. Choose the fields to be displayed by checking the box next to the field, and then click the plus sign to add it to the Selected Fields frame on the right. Arrange the order in which the fields appear by highlighting the field name and using the green arrows to move it up or down. 3. All the records in the segment can be deduplicated, or just the marked records. 5
The default option is to deduplicate all the records in the segment. 4. Click on the green Process arrow to begin the deduplication process. 5. A progress bar will appear at the bottom of the screen, along with a count of possible duplicates. found. It is possible to cancel the process at this stage by clicking on the process icon at the bottom of the screen to the right of the progress bar. Stop deduplication The speed of the process will depend on how many records have been chosen to deduplicate and how the deduplication filter has been set up. Using fuzzy searching as part of the deduplication process will slow the process considerably. 6. Once the deduplication process is complete, a message displays showing how many records have been deduplicated successfully, and how many possible duplicates have been returned. The results of the deduplication process will be displayed and the number of possible matches appears in the top right hand corner. Results are colour-coded. Records in the same colour next to each other are potential duplicates. 6
Merging duplicates one at a time It is possible to look at the records individually side by side, and then merge them: 1. Select both potential duplicates by marking the check boxes. 2. Click the Merge icon on the toolbar. 3. The records are displayed side by side. The record marked with a is a CENTRAL record. Where there is a difference in field values it will be highlighted in yellow. 7
4. By default, the CRS chooses the field with the most data as the field that will be kept in the merged record. To change this setting, click on the Merge Preferences icon. It is possible to turn off the preference to use the field with most text, and also the preference to use fields containing non-latin characters by unchecking the boxes. 8
Checking the box to Merge multiple value fields will turn on the preference to merge together fields which have more than one value. Click Apply to return to the records. 5. It is possible to opt to keep all of the fields in either one of the records by clicking on the Select all fields in this record icon. Notice that all the boxes on the selected record are now ticked and the corresponding boxes on the deselected record are now blank. 6. Alternatively, select individual fields from each record that are to be included in the merged record by marking the box next to the field in the Use column. 9
7. When the required fields are selected the records can be merged by clicking the Merge icon. 8. A message asks you to confirm, click Yes and the two records will be merged. 9. The merged records now appear in green in the list of results Marking records as not duplicates If the two records being displaying are not deemed to be duplicates they can be marked as Not Duplicate so that they don t appear in the list again. 1. Select both potential duplicates by marking the check boxes. 2. Select the Not Duplicate icon. 3. The records will now be flagged as unique and will not be displayed in the next deduplication session. 10
Merging all records There is also the option to merge all of the records without viewing them individually if you are confident that the records are all duplicates and that the default fields are the required fields in the merged record. Select marked. Mark all records. Notice that only one of the pair of records is 1. Click the Merge All icon 2. A message will be generated: Merging all records will delete all matching records that have been found within the deduplication process from the local database. Do you wish to continue? Click Yes and the records will be merged and appear in green on the deduplication screen. Deleting a record If one (or both) of the duplicates is not needed, it is possible to delete them from your segment. 1. Select the record to be deleted by marking the check box. 2. Click the Delete marked records icon 3. A message will be generated: Are you sure you want to delete 1 marked record? Click Yes and the record will be deleted Tip: It is not possible to delete records which are published in CENTRAL, only records which appear in the user s segment and are not in CENTRAL If To cancel the deduplication process must be cancelled, click the arrow on the toolbar to start again. Restart deduplication process 11
When the deduplication process has completed, click the End Deduplication icon. CENTRAL Deduplication When a record is sent from the user s segment to be published in CENTRAL, the CRS automatically searches for duplicates. The results are displayed in the CENTRAL deduplication tab. Clicking on the record to highlight it will display the local record (My record value on the left) and the CENTRAL record (Currently published CENTRAL value on the right) side by side for comparison. 12
There are then five options: 1. If the records are not duplicates, click the Add as New CENTRAL record. This will send the local record to CENTRAL as a new record and it will no longer appear as a duplicate. 2. If the record is a duplicate, but no changes are required either to the local record or the CENTRAL record, click the Add my Group code button, in blue. This will associate the user s group code with the record and cancel the publication of the local record. 3. If the record is a duplicate, but the local version has correct fields and the CENTRAL record has incorrect fields, click the Add group code and update CENTRAL fields button, in green. This will update the CENTRAL record with the correct fields. 4. If the record is a duplicate, but the CENTRAL version has correct fields and the local record has incorrect fields, click the Add group code & copy CENTRAL fields button, in black. This will update the local record with the correct fields and assign your group s code to the CENTRAL record. 5. If the record is a duplicate, but neither is completely correct, then click the button marked Cancel records CENTRAL publication request. It will then be neccessary to find the 13
record by searching for it within the CRS, and update the fields before sending the record for publication in CENTRAL again. The decisions applied to each record will appear as ticked boxes in the corresponding columns next to each record. When the process has been completed for each, click Submit. The records will then be sent for publication in CENTRAL. If a record has more than one duplicate, the CRS will display this in the Match Count column. Click on the record to highlight it, this will display both matches in two separate tabs. 14
Assign one of the five options using the same methods outlined above. Any of the five options can also be applied globally to all of the matches. To do this, use the relevant option in the smaller icons above the list of records. Tip: When a record is owned by more than one group, the CRS will notify the other group s Trials Search Co-ordinator of a conflict. Both owners must agree changes before they are finalised. 15