Overview of Record Linkage Techniques

Size: px
Start display at page:

Download "Overview of Record Linkage Techniques"

Transcription

1 Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data sets that do not share a unique database key in common. 1.1 Deterministic Record Linkage The simplest type of linkage involves exact matches on unique identifiers or combinations of fields that uniquely identify given individuals. This type of linkage is known as Deterministic Record Linkage. All identifiers must agree for a link to be made. This type of linkage works well for unique identifiers such as Medical Record Number, Social Security Number or Driver s License Number. However, it performs poorly for non-unique identifiers such as name and date of birth. Names are frequently misspelt, nicknames or aliases are used and dates of birth are often estimated. L i n k a g e W i z S o f t w a r e P a g e 1

2 1.2 Fuzzy Matching Another method is Fuzzy Matching; partial matches are permitted and matches are usually determined according to a number of subjective rules created by the user. For example, it might be determined that a pair of records should be linked if the first initial, family name and date of birth agree or if the first name, family name and address agrees but the day, month or year of birth disagrees. While this method appears relatively simple, a large number of rules and exceptions may need to be specified to maximize the accuracy of the linkage process. Alternately, a simple scoring system is sometimes used. 1.3 Probabilistic Record Linkage A very popular method is Probabilistic Record Linkage. Unlike deterministic record linkage, typographical differences and other errors do not prelude possible candidate record pairs from being matched. Mathematical probabilities derived from a large reference dataset of known linkages and non-linkages are used to derive Linkage Weights for each field. Separate weights are derived for field agreements, disagreements and missing values. The linkage weights are higher for variables with more specificity (such as family name), and lower for variables with less specificity (such as sex) The weights are calculated from the logarithm of the frequency ratio of the field being examined, where: Weight = Log 2 Frequency Frequency of agreement in LINKED pairs of agreement in UNLINKED pairs Example Calculation - Family Name: If the family name agrees in 90% of linked pairs, and only 1% of unlinked pairs then: Field Agreement Weight = LOG2(90/1) = Example Calculation - Sex: If sex agrees in 95% of linked pairs and 50% of unlinked pairs then: Field Disagreement Weight = LOG2(95/50) = 0.93 During the linkage process, the agreement or disagreement weight for each field is added to derive a combined score that represents the probability that the records refer to the same entity. There is usually a threshold above which a pair is considered a match True Linkage, and another threshold below which it is considered not to be a match L i n k a g e W i z S o f t w a r e P a g e 2

3 Non-Linkage. Between the two thresholds a pair is considered to be a Potential Linkage and may require manual review by a clerical officer. There are no precise rules for determining these thresholds, as they are affected by a range of factors; including data quality, the characteristics of the population being studied and many others. A histogram of the Linkage Scores can often be useful in displaying patterns in the data, which can subsequently be used to estimate the thresholds. The following graph indicates the relationship between scores, linkage thresholds and linkage errors: There are large numbers of record pairs with lower scores (non-linkages), and lower numbers with higher scores, indicating true linkages. In this example, there appears to be a natural break in the curve at scores higher than 21-23, with a mid-range around 17-19, and a steep increase in the curve below 17. This would tend to indicate that true linkages have a score of 21 or higher, whilst the potential linkages might fall into the range and non-linkages have a score of less than 17. As illustrated in the graph, probabilistic record linkage almost always includes a number of False Positives (records that have been linked that do not belong to the same individual) as well as False Negatives (records that have not been linked but really do belong to the same individual). Increasing the True Linkages Threshold will reduce the number of false positives, but at the expense of increased numbers of false negatives, and vice versa. When undertaking record linkage in a medical setting it is important to set a reasonably high threshold so that results from different patients are not inadvertently combined, whereas for a police investigation the threshold might be lowered to maximize the chance of detecting a specific criminal. L i n k a g e W i z S o f t w a r e P a g e 3

4 It is not possible to eliminate errors entirely as reducing one type of error always results in an increase in the other type Clerical Review Clerical review describes the process by which potential linkages are manually reviewed. Additional information may be sought from the original custodian to confirm a possible link, or if several records for an individual have already been linked then it may be possible to confirm the link using the information sourced from the other records. Depending upon the type of data it may also be important to scan for false positives such as twins or triplets. These cases usually have a high linkage score and only the first or middle initials differ; all other demographic fields are usually in agreement Blocking Blocking is the process of stratifying the linkage process to reduce the number of comparisons that must be undertaken. If blocking was not used then every record would need to be compared with every other record in the data set, resulting in a very large number of comparisons and exponentially degrading system performance. The table below indicates the number of records that would be to be compared for three different sized database tables: Number of Records Number of Comparisons 1,000 1,000,000 10, ,000,000 1,000,000 1,000,000,000,000 Blocking thus subdivides data into a set of mutually exclusive subsets (blocks) under the assumption that no matches occur across different blocks. Blocks are typically based upon fields such as family name, date of birth or business name. As blocks may be subject to typographical and spelling errors, they are usually standardized by applying a phonetic coding system such as the NYSIIS or by applying a data standardization scheme. In practice some matches may actually occur across blocks, for example, records for a woman who has changed her family name might not be linked on a block based on family name, but would be linked if a subsequent block on date of birth was instigated. For this reason multiple blocking variables are often used; maximizing the likelihood that a linkage missed by one pass will be detected by a subsequent blocking pass. L i n k a g e W i z S o f t w a r e P a g e 4

5 1.3.3 Phonetic Algorithms A phonetic algorithm is an algorithm for the indexing of words by their pronunciation. Soundex, the most well known algorithm was developed to provide a manual filing code for the USA Census documents in the early 1900 s. Soundex codes are fourcharacter strings composed of one letter followed by three numbers. For example, Johnson J525 New York State Identification and Intelligence System, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. The result is a string that can be pronounced by the reader without decoding. Unlike the Soundex algorithm relative vowel positioning is maintained. For example, Johnson JASAN Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It is more accurate than Soundex because it uses a larger set of rules for English pronunciation. For example, Johnson JNSN The main use of phonetic algorithms is during blocking, to ensure that records with common spelling variations are included in the subset of records being compared. Blocking variables are usually based on the phonetic representation of a field rather than the raw values. Phonetic algorithms should not be used to compare fields as they do not possess sufficient specificity. The example below illustrates that two quite different first names share the same Soundex code:. For example, John J500, Jane J500 In this example agreement accepting an agreement on Soundex codes would result in a false positive. When comparing two fields it is more appropriate to use a string comparison algorithm, which typically measures the differences between two strings, also known as the edit difference. Such functions include the Levenshtein distance and Jaro-Winkler distance. L i n k a g e W i z S o f t w a r e P a g e 5

6 1.3.4 Data Standardization Standardization of the data prior to linkage is very important for reducing variability and subsequently increasing the accuracy of the linkage process. It involves the removal of special characters such as punctuation and extraneous spaces, ensuring the consistent use of upper and lower case as well as the removal of invalid numerics and other noise data. Individual words such as address elements are replaced with standardized words or abbreviations. Organizational noise is removed from business names. For example, Street St Acme Motors International Pty Ltd ACME MOTORS More Information For a detailed description of probabilistic linkage techniques, you should refer to reference material such as the following publication by Howard Newcombe: Newcombe, H.B. Handbook of Record Linkage Methods for Health and Statistical Studies, Administration and Business, Oxford, U.K., Oxford University Press. L i n k a g e W i z S o f t w a r e P a g e 6

Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, August 2017

Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, August 2017 Mariet Tetty Nuryetty mariet@bps.go.id Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, 22-24 August 2017 1. Record Linkage 2. How to do it? As a rule

More information

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

Privacy Preserving Probabilistic Record Linkage

Privacy Preserving Probabilistic Record Linkage Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of

More information

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data Int'l Conf. Information and Knowledge Engineering IKE'15 187 Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen

More information

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Introduction Entity Match Service. Step-by-Step Description

Introduction Entity Match Service. Step-by-Step Description Introduction Entity Match Service In order to incorporate as much institutional data into our central alumni and donor database (hereafter referred to as CADS ), we ve developed a comprehensive suite of

More information

Record Linkage 11:35 12:04 (Sharp!)

Record Linkage 11:35 12:04 (Sharp!) Record Linkage 11:35 12:04 (Sharp!) Rich Pinder Los Angeles Cancer Surveillance Program rpinder@usc.edu NAACCR Short Course Central Cancer Registries: Design, Management and Use Presented at the NAACCR

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit Record Linkage with SAS and Link King Dinu Corbu Queensland Health Health Statistics Centre Integration and Linkage Unit Presented at Queensland Users Exploring SAS Technology QUEST 4 June 2009 Basics

More information

C exam.34q C IBM InfoSphere QualityStage v9.1 Solution Developer

C exam.34q   C IBM InfoSphere QualityStage v9.1 Solution Developer C2090-304.exam.34q Number: C2090-304 Passing Score: 800 Time Limit: 120 min C2090-304 IBM InfoSphere QualityStage v9.1 Solution Developer Exam A QUESTION 1 You re-ran a job to update the standardized data.

More information

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03)

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03) Record for a Large Master Client Index at the New York City Health Department Andrew Borthwick ChoiceMaker Technologies andrew.borthwick@choicemaker.com Executive Summary/Abstract: The New York City Department

More information

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Presentation at the 2018 Stata Conference Columbus, Ohio July 20, 2018 Keith Kranker Abstract Stata users

More information

Overview of Record Linkage for Name Matching

Overview of Record Linkage for Name Matching Overview of Record Linkage for Name Matching W. E. Winkler, william.e.winkler@census.gov NSF Workshop, February 29, 2008 Outline 1. Components of matching process and nuances Match NSF file of Ph.D. recipients

More information

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University While your data tables or spreadsheets may look good to

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Techniques for Large Scale Data Linking in SAS. By Damien John Melksham

Techniques for Large Scale Data Linking in SAS. By Damien John Melksham Techniques for Large Scale Data Linking in SAS By Damien John Melksham What is Data Linking? Called everything imaginable: Data linking, record linkage, mergepurge, entity resolution, deduplication, fuzzy

More information

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

Data Linkages - Effect of Data Quality on Linkage Outcomes

Data Linkages - Effect of Data Quality on Linkage Outcomes Data Linkages - Effect of Data Quality on Linkage Outcomes Anders Alexandersson July 27, 2016 Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, 2016 1 / 13 Introduction

More information

Grouping methods for ongoing record linkage

Grouping methods for ongoing record linkage Grouping methods for ongoing record linkage Sean M. Randall sean.randall@curtin.edu.au James H. Boyd j.boyd@curtin.edu.au Anna M. Ferrante a.ferrante@curtin.edu.au Adrian P. Brown adrian.brown@curtin.edu.au

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Data linkages in PEDSnet

Data linkages in PEDSnet 2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background

More information

Record Linkage. BYU ScholarsArchive. Brigham Young University. Stasha Ann Bown Larsen Brigham Young University - Provo. All Theses and Dissertations

Record Linkage. BYU ScholarsArchive. Brigham Young University. Stasha Ann Bown Larsen Brigham Young University - Provo. All Theses and Dissertations Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2013-12-11 Record Linkage Stasha Ann Bown Larsen Brigham Young University - Provo Follow this and additional works at: https://scholarsarchive.byu.edu/etd

More information

Linking Patients in PDMP Data

Linking Patients in PDMP Data Linking Patients in PDMP Data Kentucky All Schedule Prescription Electronic Reporting (KASPER) PDMP Training & Technical Assistance Center Webinar October 15, 2014 Jean Hall Lindsey Pierson Office of Administrative

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

The effect of data cleaning on record linkage quality

The effect of data cleaning on record linkage quality Randall et al. BMC Medical Informatics and Decision Making 2013, 13:64 RESEARCH ARTICLE Open Access The effect of data cleaning on record linkage quality Sean M Randall *, Anna M Ferrante, James H Boyd

More information

jellyfish Documentation

jellyfish Documentation jellyfish Documentation Release 0.5.6 James Turk December 01, 2016 Contents 1 Overview 1 1.1 Phonetic Encoding............................................ 1 1.1.1 American Soundex.......................................

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

United Council for Neurologic Subspecialties Examination Registration and Testing Guidelines

United Council for Neurologic Subspecialties Examination Registration and Testing Guidelines United Council for Neurologic Subspecialties Examination Registration and Testing Guidelines VERY IMPORTANT INFORMATION This message serves as your notification to register for the 2018 UCNS Behavioral

More information

IBM InfoSphere Master Data Management Version 11 Release 5. IBM InfoSphere MDM Inspector User's Guide IBM SC

IBM InfoSphere Master Data Management Version 11 Release 5. IBM InfoSphere MDM Inspector User's Guide IBM SC IBM InfoSphere Master Data Management Version 11 Release 5 IBM InfoSphere MDM Inspector User's Guide IBM SC27-6720-01 IBM InfoSphere Master Data Management Version 11 Release 5 IBM InfoSphere MDM Inspector

More information

IBM InfoSphere MDM Inspector User's Guide

IBM InfoSphere MDM Inspector User's Guide IBM InfoSphere Master Data Management Version 11 Release 0 IBM InfoSphere MDM Inspector User's Guide GI13-2653-00 IBM InfoSphere Master Data Management Version 11 Release 0 IBM InfoSphere MDM Inspector

More information

Adaptive Temporal Entity Resolution on Dynamic Databases

Adaptive Temporal Entity Resolution on Dynamic Databases Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National

More information

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases A Thesis Presented to The Faculty of the Computer Science Program California State University Channel Islands In (Partial)

More information

Match Engine Reference Release

Match Engine Reference Release [1]Oracle Healthcare Master Person Index Match Engine Reference Release 2.0.11 E25254-04 April 2016 Oracle Healthcare Master Person Index Match Engine Reference, Release 2.0.11 E25254-04 Copyright 2011,

More information

Week 2: Frequency distributions

Week 2: Frequency distributions Types of data Health Sciences M.Sc. Programme Applied Biostatistics Week 2: distributions Data can be summarised to help to reveal information they contain. We do this by calculating numbers from the data

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Private Record Linkage

Private Record Linkage Undefined 0 (2016) 1 1 IOS Press Private Record Linkage An analysis of the accuracy, efficiency, and security of selected techniques for name matching Pawel Grzebala and Michelle Cheatham Wright State

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Duplicate Constituents and Merge Tasks Guide

Duplicate Constituents and Merge Tasks Guide Duplicate Constituents and Merge Tasks Guide 06/12/2017 Altru 4.96 Duplicate Constituents and Merge Tasks US 2017 Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted

More information

Comparison of Online Record Linkage Techniques

Comparison of Online Record Linkage Techniques International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI.

More information

Single Error Analysis of String Comparison Methods

Single Error Analysis of String Comparison Methods Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 2, Australia peter.christen@anu.edu.au Abstract. Comparing

More information

Package phonics. February 13, Type Package Title Phonetic Spelling Algorithms Version Date Encoding UTF-8

Package phonics. February 13, Type Package Title Phonetic Spelling Algorithms Version Date Encoding UTF-8 Package phonics February 13, 2018 Type Package Title Phonetic Spelling Algorithms Version 1.0.0 Date 2018-02-13 Encoding UTF-8 URL https://jameshoward.us/software/phonics/, https://github.com/howardjp/phonics

More information

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data Marie B. Synnestvedt, MSEd 1, 2 1 Drexel University College of Information Science

More information

IBM Initiate Inspector Version 10 Release 0. User's Guide GI

IBM Initiate Inspector Version 10 Release 0. User's Guide GI IBM Initiate Inspector Version 10 Release 0 User's Guide GI13-2604-00 IBM Initiate Inspector Version 10 Release 0 User's Guide GI13-2604-00 Note Before using this information and the product that it supports,

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political

More information

Similarity Analysis of Patients Data: Bangladesh Perspective

Similarity Analysis of Patients Data: Bangladesh Perspective Bangladesh University of Engineering and Technology From the SelectedWorks of Shahidul Islam Khan December 17, 2016 Similarity Analysis of Patients Data: Bangladesh Perspective Shahidul Islam Khan, Bangladesh

More information

Exam : C : IBM InfoSphere Quality Stage v8 Examination. Title. Version : DEMO

Exam : C : IBM InfoSphere Quality Stage v8 Examination. Title. Version : DEMO Exam : C2090-419 Title : IBM InfoSphere Quality Stage v8 Examination Version : DEMO 1. When running Word Investigation, producing a pattern report will help you do what? A. Refine a standardization rule

More information

Private Record linkage: Comparison of selected techniques for name matching

Private Record linkage: Comparison of selected techniques for name matching Private Record linkage: Comparison of selected techniques for name matching Pawel Grzebala and Michelle Cheatham DaSe Lab, Wright State University, Dayton OH 45435, USA, grzebala.2@wright.edu, michelle.cheatham@wright.edu

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

USPTO INVENTOR DISAMBIGUATION

USPTO INVENTOR DISAMBIGUATION Team Member: Yang GuanCan Zhang Jing Cheng Liang Zhang HaiChao Lv LuCheng Wang DaoRen USPTO INVENTOR DISAMBIGUATION Institute of Scientific and Technical Information of China SEP 20, 2015 Content 1. Data

More information

Understanding the Master Index Match Engine

Understanding the Master Index Match Engine Understanding the Master Index Match Engine Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95054 U.S.A. Part No: 820 4000 15 December 2008 Copyright 2008 Sun Microsystems, Inc. 4150 Network

More information

Patient Matching A-Z Wednesday, March 2nd 2016

Patient Matching A-Z Wednesday, March 2nd 2016 Patient Matching A-Z Wednesday, March 2nd 2016 Adam W. Culbertson, Innovator-in-Residence HHS, HIMSS Overview Overview of Innovator-in-Residence Program Background on Patient Matching Challenges to Matching

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Kosuke Imai Princeton University Talk at SOSC Seminar Hong Kong University of Science and Technology June 14, 2017 Joint

More information

Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau

Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau KEY WORDS string comparator, record linkage, edit distance Abstract We compare

More information

Get Better Genealogical Results from

Get Better Genealogical Results from S.C. Computer / Genealogy Special Interest Group Karen Ristic Get Better Genealogical Results from Part 1: Basic Search Strategies March 14, 2013 2013 Karen Ristic A. What is Google? 1. Meaning 1. Googol

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

CountryData Technologies for Data Exchange. SDMX Information Model: An Introduction

CountryData Technologies for Data Exchange. SDMX Information Model: An Introduction CountryData Technologies for Data Exchange SDMX Information Model: An Introduction SDMX Information Model An abstract model, from which actual implementations are derived. Implemented in XML and GESMES,

More information

Patient Identity Integrity Toolkit

Patient Identity Integrity Toolkit Patient Identity Integrity Toolkit Patient Identity Integrity Key Performance Indicators 1.0 Introduction It is well established that Patient Identity (PI) Integrity impacts the success and effectiveness

More information

9.2 Types of Errors in Hypothesis testing

9.2 Types of Errors in Hypothesis testing 9.2 Types of Errors in Hypothesis testing 1 Mistakes we could make As I mentioned, when we take a sample we won t be 100% sure of something because we do not take a census (we only look at information

More information

Locating People Using Advanced Person Search

Locating People Using Advanced Person Search Locating People Using Advanced Person Search Advanced Person Search allows you to include additional information about your subject, such as a relative name or previous state of residence, or even use

More information

DIRECT CERTIFICATION

DIRECT CERTIFICATION DIRECT CERTIFICATION New Jersey Department of Agriculture Division of Food and Nutrition School Nutrition Programs TABLE OF CONTENTS INTRODUCTION 3 ACCESS TO SCHOOL NUTRITION ELECTRONINC APPLICATION &

More information

Estimating parameters for probabilistic linkage of privacy-preserved datasets

Estimating parameters for probabilistic linkage of privacy-preserved datasets Brown et al. BMC Medical Research Methodology (2017) 17:95 DOI 10.1186/s12874-017-0370-0 RESEARCH ARTICLE Open Access Estimating parameters for probabilistic linkage of privacy-preserved datasets Adrian

More information

The Matching Engine. The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process

The Matching Engine. The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process The Matching Engine The Science of Maximising Legitimate Matches, Minimising False Matches and Taking Control of the Matching Process CLEANER DATA. BETTER DECISIONS. The Challenge of Contact Data Matching

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

2 Corporation Way Suite 150 Peabody, MA

2 Corporation Way Suite 150 Peabody, MA 2 Corporation Way Suite 150 Peabody, MA 01960 888-746-3463 www.locateplus.com Table of Contents Page 3 18 Free Searches and Reports VIP Customer Service & Site Walk-Through s Selecting a GLB Use for Searches

More information

Oracle Java CAPS Master Index Match Engine Reference

Oracle Java CAPS Master Index Match Engine Reference Oracle Java CAPS Master Index Match Engine Reference Part No: 821 2662 March 2011 Copyright 2009, 2011, Oracle and/or its affiliates. All rights reserved. License Restrictions Warranty/Consequential Damages

More information

Fuzzy Matching in Fraud Analytics. Grant Brodie, President, Arbutus Software

Fuzzy Matching in Fraud Analytics. Grant Brodie, President, Arbutus Software Fuzzy Matching in Fraud Analytics Grant Brodie, President, Arbutus Software Outline What Is Fuzzy? Causes Effective Implementation Application to Specific Products Demonstration Q&A 2 Why Is Fuzzy Important?

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle  holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/887/2976 holds various files of this Leiden University dissertation. Author: Schraagen, Marijn Paul Title: Aspects of record linkage Issue Date: 24-- Chapter

More information

Cape Breton- Victoria Regional School Board

Cape Breton- Victoria Regional School Board Cape Breton- Victoria Regional School Board APPLICATION PROCEDURE FOR SUBSTITUTE TEACHING Complete substitute application form and attach a photocopy of your valid Nova Scotia teaching license, along with

More information

Fuzzy Name-Matching Applications

Fuzzy Name-Matching Applications SESUG 2016 Paper DM-109 Fuzzy Name-Matching Applications Alan Dunham, Greybeard Solutions, LLC ABSTRACT Fuzzy matching functions available in SAS allow efficient searches for similar character strings

More information

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda 1 / 14 Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda Stephen E. Fienberg Department of Statistics, Heinz College, and Machine Learning Department, Carnegie Mellon

More information

A Comparison of Personal Name Matching: Techniques and Practical Issues

A Comparison of Personal Name Matching: Techniques and Practical Issues A Comparison of Personal Name Matching: Techniques and Practical Issues Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Peter.Christen@anu.edu.au

More information

The Genetic Algorithm for finding the maxima of single-variable functions

The Genetic Algorithm for finding the maxima of single-variable functions Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 46-54 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.com The Genetic Algorithm for finding

More information

Managing the Razor s Edge: Driving the value of Master Data Management (MDM) through technology and stewardship

Managing the Razor s Edge: Driving the value of Master Data Management (MDM) through technology and stewardship WHITE PAPER : Driving the value of Master Data Management (MDM) through technology and stewardship AUGUST 2016 If access to reliable customer data is critical for virtually all enterprise operations, then

More information

II TupleRank: Ranking Discovered Content in Virtual Databases 2

II TupleRank: Ranking Discovered Content in Virtual Databases 2 I Automatch: Database Schema Matching Using Machine Learning with Feature Selection 1 II TupleRank: Ranking Discovered Content in Virtual Databases 2 Jacob Berlin and Amihai Motro 1. Proceedings of CoopIS

More information

Object Identification in Ultrasound Scans

Object Identification in Ultrasound Scans Object Identification in Ultrasound Scans Wits University Dec 05, 2012 Roadmap Introduction to the problem Motivation Related Work Our approach Expected Results Introduction Nowadays, imaging devices like

More information

penelope case management software DOCUMENT BUILDING v4.0 and up

penelope case management software DOCUMENT BUILDING v4.0 and up penelope case management software DOCUMENT BUILDING v4.0 and up Last modified: May 12, 2016 TABLE OF CONTENTS Documents: The basics... 5 About Documents... 5 View the list of existing Documents... 5 Types

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

Record Matching: Past, Present and Future

Record Matching: Past, Present and Future Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 2001 Record Matching: Past, Present and Future M. Cochinwala S. Dalal Ahmed K. Elmagarmid

More information

Phase II CAQH CORE 258: Eligibility and Benefits 270/271 Normalizing Patient Last Name Rule version March 2011

Phase II CAQH CORE 258: Eligibility and Benefits 270/271 Normalizing Patient Last Name Rule version March 2011 Phase II CAQH CORE 258: Eligibility and Benefits 270/271 Normalizing Patient Last Name Rule Document #3 for Straw Poll of Rules Work Group Page 1 of 10 Table of Contents 1 BACKGROUND... 3 2 ISSUE TO BE

More information

Integrating BigMatch into Automated Registry Record Linkage Operations

Integrating BigMatch into Automated Registry Record Linkage Operations Integrating BigMatch into Automated Registry Record Linkage Operations 2014 NAACCR Annual Conference June 25, 2014 Jason Jacob, MS, Isaac Hands, MPH, David Rust, MS Kentucky Cancer Registry Overview Record

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

The Grid 2 is accessible to everybody, accepting input from eye gaze, switches, headpointer, touchscreen, mouse, and other options too.

The Grid 2 is accessible to everybody, accepting input from eye gaze, switches, headpointer, touchscreen, mouse, and other options too. The Grid 2-89224 Product Overview The Grid 2 is an all-in-one package for communication and access. The Grid 2 allows people with limited or unclear speech to use a computer as a voice output communication

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Novel Lossy Compression Algorithms with Stacked Autoencoders

Novel Lossy Compression Algorithms with Stacked Autoencoders Novel Lossy Compression Algorithms with Stacked Autoencoders Anand Atreya and Daniel O Shea {aatreya, djoshea}@stanford.edu 11 December 2009 1. Introduction 1.1. Lossy compression Lossy compression is

More information

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources Indian Journal of Science and Technology, Vol 8(23), DOI: 10.17485/ijst/2015/v8i23/79342 September 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Ontology-based Integration and Refinement of Evaluation-Committee

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

A Bayesian decision model for cost optimal record matching

A Bayesian decision model for cost optimal record matching The VLDB Journal (2003) 12: 28 40 / Digital Object Identifier (DOI) 10.1007/s00778-002-0072-y A Bayesian decision model for cost optimal record matching Vassilios S. Verykios 1, George V. Moustakides 2,

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE sbsridevi89@gmail.com 287 ABSTRACT Fingerprint identification is the most prominent method of biometric

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,

More information

Redirection Of Domestic Mail

Redirection Of Domestic Mail APPLICATION FOR April 2017 Redirection Of Domestic Mail WHAT THE SERVICE OFFERS Jersey Post s domestic mail redirection services enables customers to have their mail redirected to an alternative address

More information

Q &A on Entity Relationship Diagrams. What is the Point? 1 Q&A

Q &A on Entity Relationship Diagrams. What is the Point? 1 Q&A 1 Q&A Q &A on Entity Relationship Diagrams The objective of this lecture is to show you how to construct an Entity Relationship (ER) Diagram. We demonstrate these concepts through an example. To break

More information

COMN 1.1 Reference. Contents. COMN 1.1 Reference 1. Revision 1.1, by Theodore S. Hills, Copyright

COMN 1.1 Reference. Contents. COMN 1.1 Reference 1. Revision 1.1, by Theodore S. Hills, Copyright COMN 1.1 Reference 1 COMN 1.1 Reference Revision 1.1, 2017-03-30 by Theodore S. Hills, thills@acm.org. Copyright 2015-2016 Contents 1 Introduction... 2 1.1 Release 1.1... 3 1.2 Release 1.0... 3 1.3 Release

More information

Basic Statistical Terms and Definitions

Basic Statistical Terms and Definitions I. Basics Basic Statistical Terms and Definitions Statistics is a collection of methods for planning experiments, and obtaining data. The data is then organized and summarized so that professionals can

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES

More information

CHAPTER 3: Data Description

CHAPTER 3: Data Description CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68 You ll find a link on our website to a

More information

IBM Initiate Master Data Service. Glossary. Version9Release7 SC

IBM Initiate Master Data Service. Glossary. Version9Release7 SC IBM Initiate Master Data Service Glossary Version9Release7 SC19-3152-01 IBM Initiate Master Data Service Glossary Version9Release7 SC19-3152-01 Note Before using this information and the product that

More information