Data Linkages - Effect of Data Quality on Linkage Outcomes

Size: px
Start display at page:

Download "Data Linkages - Effect of Data Quality on Linkage Outcomes"

Transcription

1 Data Linkages - Effect of Data Quality on Linkage Outcomes Anders Alexandersson July 27, 2016 Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

2 Introduction Data linkage synonyms = record linkage, record matching, re-identification, entitity heterogeneity, and merge/purge. Aim = Determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals. Use linkage criteria to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

3 The Problem Ideally, all matches are classified as links, and all non-matches are classified as non-links. This presentation will demonstrate how data quality affects linkage outcomes. There are two types of possible errors: Type 1: False matches = linked non-matches ( false positives ) Type 2: Missed matches = non-linked matches ( false negatives ) Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

4 The Table of Confusion The four outcomes can be displayed in a 2*2 table of confusion or error matrix : Figure 1: Table of Confusion Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

5 Linkage Quality Measures Match status errors: True positive rate (TPR), matching rate, sensitivity, power = TP / Matches FNR, miss rate, beta (alpha in R) error = FN / Matches FPR, false match rate, alpha (beta in R) error = FP / Non-matches TNR or specificity = TN / Non-matches Linkage errors: Positive predictive value (PPV) or precision = TP / Links False discovery rate, false match rate (again!) = FP / Links False omission rate = FN / Non-links Negative predictive value (NPV) = TN / Non-links Record pairs quality measures: Accuracy = (TP + TN) / Record pairs Prevalence = Links / Record pairs Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

6 The Solution: Probabilistic Record Linkage The theory behind probabilistic record linkage is based on probabilities. This improves on traditional, simple rule-based, deterministic record linkage. The standard reference is Fellegi-Sunter (1969). FPR = FP / Non-matches = u-probability TPR = TP / Matches = m-probability In practice, the process involves three key steps: 1 Preprocessing 2 Linking 3 Clerical review Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

7 Step 1: Preprocessing Typically, preprocessing consists of two substeps: 1 parse a field (variable, column) into the relevant subcomponents 2 standardize common character strings Several data linkage software do not have features for preprocessing. Examples are BigMatch and the R package RecordLinkage. For preprocessing, any good stat software will work. We use the NYSIIS phonetic code to handle spelling mistakes in names. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

8 Example code Here is example code in Stata. The original data are in R.. R: load("rldata500.rda"). R: load.data(rldata500). decode fname_c1, gen(fname_c1s). nysiis fname_c1s, gen(nysf). list fname_c1 nysf in 1/ fname_c1 nysf CARSTEN carstan 2. GERD gad 3. ROBERT rabad Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

9 Step 2: Linking At FCDS, we use the user-written R package RecordLinkage for the linking. We used to use the software AutoMatch. Example code in R: rpairs <- compare.linkage(rmort1,rpatient1,blockfld=c( ssn, sex ), strcmp=4:7,exclude=c( pid, address, st, county, zip, mi )); rpairs$pairs[c(1:5), ]; # (list obs 1-5, comparison pattern only) rpairs <- emweights(rpairs); # (calculate EM weights) summary(rpairs); # (show weight distribution ### pairs) tail(getpairs(rpairs, 40, 30)); # review obs to determine thresholds result <- emclassify(rpairs, 40, 30); # classification summary(result); Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, / 13

10 Example output in R Anders Alexandersson Figure 2: Linkage result in R Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

11 Step 3: Clerical Review At FCDS, we use the user-written Stata command clrevmatch for the clerical review. Example code in Stata: clrevmatch using cler_reviewed_14jul2016, idm(mort_id) idu(pat_id) /// varm(pass mort_id id1 fname_1 lname_1 ssn_1 dob_1 sex_1 race_1) /// varu(pass pat_id id2 fname_2 lname_2 ssn_2 dob_2 sex_2 race_2) /// clrev_result(crev) clrev_label(0 not match 1 match ) /// clrev_note(crnote) /// rlscoremin(30) rlscoremax(45) reclinkscore(weight) /// nobssave(1) replace saveold Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

12 Data Linkage Requirements 1 At a minimum, the following information is required to link records with FCDS: First name, Last name, Sex, and Date of Birth and/or Social Security Number. 2 Additional information such as Middle Initial, Alias Name, Maiden Name, Race, Street Address, City, State, Zip Code and Birthplace improves linkage outcomes. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

13 Conclusion Data quality is central to data linkage outcomes! 1 Quality of identifiers: Most important. 2 Quality of linkage methods: Probabilistic linkage is recommended but has limitations. 3 Quality of evaluation: A clerical review note is better than usual. Match-status data would be best. Future work: 1 Improve existing code template. For example, Stata users can use more efficient code with command Rcall than with rsource. 2 Learn more R to better understand the package RecordLinkage. For example, it is possible but very challenging to create match-status data. R users can use Stata code with the package RStata. 3 Stay on top of methods. Examples are machine learning and literate programming. 4 Stay on top of software developments. For instance, a new version of LinkPlus is expected this year. Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage Outcomes July 27, / 13

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems

More information

Chuck Cartledge, PhD. 23 September 2017

Chuck Cartledge, PhD. 23 September 2017 Introduction K-Nearest Neighbors Na ıve Bayes Hands-on Q&A Conclusion References Files Misc. Big Data: Data Analysis Boot Camp Classification with K-Nearest Neighbors and Na ıve Bayes Chuck Cartledge,

More information

CS4491/CS 7265 BIG DATA ANALYTICS

CS4491/CS 7265 BIG DATA ANALYTICS CS4491/CS 7265 BIG DATA ANALYTICS EVALUATION * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Dr. Mingon Kang Computer Science, Kennesaw State University Evaluation for

More information

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Keith Kranker dtalink Faster probabilistic record linking and deduplication methods in Stata for large data files Presentation at the 2018 Stata Conference Columbus, Ohio July 20, 2018 Keith Kranker Abstract Stata users

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Classification Part 4

Classification Part 4 Classification Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Model Evaluation Metrics for Performance Evaluation How to evaluate

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024 Cross- Valida+on & ROC curve Anna Helena Reali Costa PCS 5024 Resampling Methods Involve repeatedly drawing samples from a training set and refibng a model on each sample. Used in model assessment (evalua+ng

More information

Model s Performance Measures

Model s Performance Measures Model s Performance Measures Evaluating the performance of a classifier Section 4.5 of course book. Taking into account misclassification costs Class imbalance problem Section 5.7 of course book. TNM033:

More information

Classes for record linkage of big data sets

Classes for record linkage of big data sets Classes for record linkage of big data sets Andreas Borg, Murat Sariyar July 27, 201 As of version 0., the package RecordLinkage includes extensions to overcome the problem of high memory consumption that

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political

More information

MEASURING CLASSIFIER PERFORMANCE

MEASURING CLASSIFIER PERFORMANCE MEASURING CLASSIFIER PERFORMANCE ERROR COUNTING Error types in a two-class problem False positives (type I error): True label is -1, predicted label is +1. False negative (type II error): True label is

More information

Link Plus. A Probabilistic Record Linkage Tool for Cancer Registry Data Linking and Deduplicating. Joe Rogers David Gu Tom Rawson

Link Plus. A Probabilistic Record Linkage Tool for Cancer Registry Data Linking and Deduplicating. Joe Rogers David Gu Tom Rawson Link Plus A Probabilistic Record Linkage Tool for Cancer Registry Data Linking and Deduplicating Joe Rogers David Gu Tom Rawson DEPARTMENT OF HEALTH AND HUMAN SERVICES CENTERS FOR DISEASE CONTROL AND PREVENTION

More information

Lab 3: Building Compound Comparisons

Lab 3: Building Compound Comparisons Lab 3: Building Compound Comparisons In this lab you will build a series of Compound Comparisons. Each Compound Comparison will relate to a group of related Identifiers. And each will hold an ordered list

More information

Data linkages in PEDSnet

Data linkages in PEDSnet 2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Machine-Learning Methods. Goals for the lecture Evaluating Machine-Learning Methods Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from

More information

Integrating BigMatch into Automated Registry Record Linkage Operations

Integrating BigMatch into Automated Registry Record Linkage Operations Integrating BigMatch into Automated Registry Record Linkage Operations 2014 NAACCR Annual Conference June 25, 2014 Jason Jacob, MS, Isaac Hands, MPH, David Rust, MS Kentucky Cancer Registry Overview Record

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data Int'l Conf. Information and Knowledge Engineering IKE'15 187 Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen

More information

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio Machine Learning nearest neighbors classification Luigi Cerulo Department of Science and Technology University of Sannio Nearest Neighbors Classification The idea is based on the hypothesis that things

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Pattern recognition (4)

Pattern recognition (4) Pattern recognition (4) 1 Things we have discussed until now Statistical pattern recognition Building simple classifiers Supervised classification Minimum distance classifier Bayesian classifier (1D and

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Kosuke Imai Princeton University Talk at SOSC Seminar Hong Kong University of Science and Technology June 14, 2017 Joint

More information

Privacy Preserving Probabilistic Record Linkage

Privacy Preserving Probabilistic Record Linkage Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of

More information

The Link King v6.0 User Manual Update

The Link King v6.0 User Manual Update The Link King v6.0 User Manual Update The Link King v6.0 features upgrades in four areas: Enhancement to the display of the final linkage map. Enhancements to preserve the integrity of linked record clusters

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

HOW TO GENERATE AND UNDERSTAND THE UPLOAD LOG REPORT

HOW TO GENERATE AND UNDERSTAND THE UPLOAD LOG REPORT Florida SHOTS HOW TO GENERATE AND UNDERSTAND THE UPLOAD LOG REPORT www.flshots.com Data Upload Log Review: The main reason to generate an upload log report is to ensure that data is being uploaded to Florida

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton Harvard Talk at the Tech Science Seminar IQSS, Harvard University

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification

More information

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University Kinds of Clustering Sequential Fast Cost Optimization Fixed number of clusters Hierarchical

More information

Evaluating Machine Learning Methods: Part 1

Evaluating Machine Learning Methods: Part 1 Evaluating Machine Learning Methods: Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts bias of an estimator learning curves stratified sampling cross validation

More information

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57 lassification-issues, Slide 1/57 Classification Practical Issues Huiping Cao lassification-issues, Slide 2/57 Outline Criteria to evaluate a classifier Underfitting and overfitting Model evaluation lassification-issues,

More information

Binary Diagnostic Tests Clustered Samples

Binary Diagnostic Tests Clustered Samples Chapter 538 Binary Diagnostic Tests Clustered Samples Introduction A cluster randomization trial occurs when whole groups or clusters of individuals are treated together. In the twogroup case, each cluster

More information

Automatic Detection of Change in Address Blocks for Reply Forms Processing

Automatic Detection of Change in Address Blocks for Reply Forms Processing Automatic Detection of Change in Address Blocks for Reply Forms Processing K R Karthick, S Marshall and A J Gray Abstract In this paper, an automatic method to detect the presence of on-line erasures/scribbles/corrections/over-writing

More information

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Package RecordLinkage

Package RecordLinkage Version 0.4-8 Title Record Linkage in R Package RecordLinkage May 28, 2015 Author Andreas Borg , Murat Sariyar Maintainer Andreas Borg

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

CS 584 Data Mining. Classification 3

CS 584 Data Mining. Classification 3 CS 584 Data Mining Classification 3 Today Model evaluation & related concepts Additional classifiers Naïve Bayes classifier Support Vector Machine Ensemble methods 2 Model Evaluation Metrics for Performance

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Online Batch Services

Online Batch Services Online Batch Services LexisNexis has enhanced its batch services to allow more user-friendly functionality for uploading batches and mapping layouts. Users sign in to the main product to access the online

More information

Multimedia Retrieval. Chapter 1: Performance Evaluation. Dr. Roger Weber, Computer Science / / 2018

Multimedia Retrieval. Chapter 1: Performance Evaluation. Dr. Roger Weber, Computer Science / / 2018 Computer Science / 15731-01 / 2018 Multimedia Retrieval Chapter 1: Performance Evaluation Dr. Roger Weber, roger.weber@ubs.com 1.1 Introduction 1.2 Defining a Benchmark for Retrieval 1.3 Boolean Retrieval

More information

Expectation Maximization!

Expectation Maximization! Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University and http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Steps in Clustering Select Features

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Wildfire smoke-detection algorithms evaluation

Wildfire smoke-detection algorithms evaluation Wildfire smoke-detection algorithms evaluation Toni Jakovčević, Ljiljana Šerić, Darko Stipaničev, Damir Krstinić Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University

More information

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Machine Learning for. Artem Lind & Aleskandr Tkachenko Machine Learning for Object Recognition Artem Lind & Aleskandr Tkachenko Outline Problem overview Classification demo Examples of learning algorithms Probabilistic modeling Bayes classifier Maximum margin

More information

Disease prediction in the at-risk mental state for psychosis using neuroanatomical biomarkers: results from the FePsy-study. Supplementary material

Disease prediction in the at-risk mental state for psychosis using neuroanatomical biomarkers: results from the FePsy-study. Supplementary material Disease prediction in the at-risk mental state for psychosis using neuroanatomical biomarkers: results from the FePsy-study. Nikolaos Koutsouleris a,ca, MD; Stefan Borgwardt b, MD; Eva M. Meisenzahl, MD;

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Machine learning in fmri

Machine learning in fmri Machine learning in fmri Validation Alexandre Savio, Maite Termenón, Manuel Graña 1 Computational Intelligence Group, University of the Basque Country December, 2010 1/18 Outline 1 Motivation The validation

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

Probabilistic Classifiers DWML, /27

Probabilistic Classifiers DWML, /27 Probabilistic Classifiers DWML, 2007 1/27 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium

More information

Machine Learning: Symbolische Ansätze

Machine Learning: Symbolische Ansätze Machine Learning: Symbolische Ansätze Evaluation and Cost-Sensitive Learning Evaluation Hold-out Estimates Cross-validation Significance Testing Sign test ROC Analysis Cost-Sensitive Evaluation ROC space

More information

Package hmeasure. February 20, 2015

Package hmeasure. February 20, 2015 Type Package Package hmeasure February 20, 2015 Title The H-measure and other scalar classification performance metrics Version 1.0 Date 2012-04-30 Author Christoforos Anagnostopoulos

More information

Please provide us with your current information below. Your personal information is required in order for us to properly process your dispute.

Please provide us with your current information below. Your personal information is required in order for us to properly process your dispute. Consumer Dispute In accordance with FCRA guidelines, your dispute investigation will be completed within thirty days. A Trusted Employees representative will contact you if we require further information

More information

HOW TO TEST CROSS-DEVICE PRECISION & SCALE

HOW TO TEST CROSS-DEVICE PRECISION & SCALE HOW TO TEST CROSS-DEVICE PRECISION & SCALE Introduction A key consideration when implementing cross-device campaigns is how to strike the right balance between precision and scale. Your cross-device campaign

More information

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes Week 4 Based in part on slides from textbook, slides of Susan Holmes Part I Classification & Decision Trees October 19, 2012 1 / 1 2 / 1 Classification Classification Problem description We are given a

More information

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J. 1 J. Fürnkranz Nächste Woche Dienstag, 2. 12.: Vortrag Ian Witten (statt Vorlesung) IGD, 10h 4 Donnerstag, 4.12.: Übung (keine Vorlesung) 2 J. Fürnkranz Evaluation and Cost-Sensitive Learning Evaluation

More information

Data Mining Classification: Bayesian Decision Theory

Data Mining Classification: Bayesian Decision Theory Data Mining Classification: Bayesian Decision Theory Lecture Notes for Chapter 2 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd ed. New York: Wiley, 2001. Lecture Notes for Chapter

More information

Assessing Deduplication and Data Linkage Quality: What to Measure?

Assessing Deduplication and Data Linkage Quality: What to Measure? Assessing Deduplication and Data Linkage Quality: What to Measure? http://datamining.anu.edu.au/linkage.html Peter Christen and Karl Goiser Department of Computer Science, Australian National University,

More information

[Programming Assignment] (1)

[Programming Assignment] (1) http://crcv.ucf.edu/people/faculty/bagci/ [Programming Assignment] (1) Computer Vision Dr. Ulas Bagci (Fall) 2015 University of Central Florida (UCF) Coding Standard and General Requirements Code for all

More information

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Marin HMIS Online. Introduction to using the Client Services Network

Marin HMIS Online. Introduction to using the Client Services Network Marin HMIS Online Introduction to using the Client Services Network First time logging into the system To enter the system go to https://www.clientservicesnetwork.com/csnmarinca/ Click on Login. The Login

More information

Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, August 2017

Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, August 2017 Mariet Tetty Nuryetty mariet@bps.go.id Session 6 Population and Housing Censuses; Registers of Population, Dwelling, and Buildings Brunei, 22-24 August 2017 1. Record Linkage 2. How to do it? As a rule

More information

ECLT 5810 Evaluation of Classification Quality

ECLT 5810 Evaluation of Classification Quality ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:

More information

DLS DEF1437. Case 2:13-cv Document Filed in TXSD on 11/19/14 Page 1 of 10 USE CASE SPECIFICATION. 2:13-cv /02/2014

DLS DEF1437. Case 2:13-cv Document Filed in TXSD on 11/19/14 Page 1 of 10 USE CASE SPECIFICATION. 2:13-cv /02/2014 Case 2:13-cv-00193 Document 774-33 Filed in TXSD on 11/19/14 Page 1 of 10 An USE CASE SPECIFICATION ISSUE ELECTION CERTIFICATE Texas Department of Public Safety September 13 2013 Version 10 2:13-cv-193

More information

Overview of Record Linkage for Name Matching

Overview of Record Linkage for Name Matching Overview of Record Linkage for Name Matching W. E. Winkler, william.e.winkler@census.gov NSF Workshop, February 29, 2008 Outline 1. Components of matching process and nuances Match NSF file of Ph.D. recipients

More information

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery Practice notes 2 Keywords Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si Data Attribute, example, attribute-value data, target variable, class, discretization Algorithms

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

State of Michigan Sex Offender Procedures for OffenderWatch : Importing and setting up an initial Verification Cycle for newly released offenders

State of Michigan Sex Offender Procedures for OffenderWatch : Importing and setting up an initial Verification Cycle for newly released offenders State of Michigan Sex Offender Procedures for OffenderWatch : Importing and setting up an initial Verification Cycle for newly released offenders After logging in to OffenderWatch, clicking Offender Search

More information

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Data Mining D E C I S I O N T R E E. Matteo Golfarelli Data Mining D E C I S I O N T R E E Matteo Golfarelli Decision Tree It is one of the most widely used classification techniques that allows you to represent a set of classification rules with a tree. Tree:

More information

10 Classification: Evaluation

10 Classification: Evaluation CSE4334/5334 Data Mining 10 Classification: Evaluation Chengkai Li Department of Computer Science and Engineering University of Texas at Arlington Fall 2018 (Slides courtesy of Pang-Ning Tan, Michael Steinbach

More information

Enter Background Check Request

Enter Background Check Request DFPS Enter Background Check Request A step-by-step guide for Designated ABCS Representatives Department of Family and Protective Services 1/29/2015 Enter Background Check in ABCS This tip sheet will show

More information

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM

EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM EVALUATIONS OF THE EFFECTIVENESS OF ANOMALY BASED INTRUSION DETECTION SYSTEMS BASED ON AN ADAPTIVE KNN ALGORITHM Assosiate professor, PhD Evgeniya Nikolova, BFU Assosiate professor, PhD Veselina Jecheva,

More information

A NOVEL ALGORITHM FOR THE AUTHENTICATION OF INDIVIDUALS THROUGH RETINAL VASCULAR PATTERN RECOGNITION

A NOVEL ALGORITHM FOR THE AUTHENTICATION OF INDIVIDUALS THROUGH RETINAL VASCULAR PATTERN RECOGNITION XX IMEKO World Congress Metrology for Green Growth September 9 14, 2012, Busan, Republic of Korea A NOVEL ALGORITHM FOR THE AUTHENTICATION OF INDIVIDUALS THROUGH RETINAL VASCULAR PATTERN RECOGNITION L.

More information

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes Data Mining and Knowledge Discovery: Practice Notes Petra Kralj Novak Petra.Kralj.Novak@ijs.si 8.11.2017 1 Keywords Data Attribute, example, attribute-value data, target variable, class, discretization

More information

Human Object Classification in Daubechies Complex Wavelet Domain

Human Object Classification in Daubechies Complex Wavelet Domain Human Object Classification in Daubechies Complex Wavelet Domain Manish Khare 1, Rajneesh Kumar Srivastava 1, Ashish Khare 1(&), Nguyen Thanh Binh 2, and Tran Anh Dien 2 1 Image Processing and Computer

More information

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain Learning Classifier Systems for Class Imbalance Problems Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain Aim Enhance the applicability

More information

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

Measuring Intrusion Detection Capability: An Information- Theoretic Approach Measuring Intrusion Detection Capability: An Information- Theoretic Approach Guofei Gu, Prahlad Fogla, David Dagon, Wenke Lee Georgia Tech Boris Skoric Philips Research Lab Outline Motivation Problem Why

More information

North Carolina State Laboratory of Public Health HIS HIV Sample Submission Label Format Specifications Forms: DHHS 1111 and 3707

North Carolina State Laboratory of Public Health HIS HIV Sample Submission Label Format Specifications Forms: DHHS 1111 and 3707 North Carolina State Laboratory of Public Health HIS HIV Sample Submission Label Format Specifications Forms: DHHS 1111 and 3707 Updated: Version 1.3 This document defines the State Laboratory of Public

More information

Background Motion Video Tracking of the Memory Watershed Disc Gradient Expansion Template

Background Motion Video Tracking of the Memory Watershed Disc Gradient Expansion Template , pp.26-31 http://dx.doi.org/10.14257/astl.2016.137.05 Background Motion Video Tracking of the Memory Watershed Disc Gradient Expansion Template Yao Nan 1, Shen Haiping 2 1 Department of Jiangsu Electric

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8.4 & 8.5 Han, Chapters 4.5 & 4.6 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Online Batch Services

Online Batch Services Online Batch Services LexisNexis has enhanced its batch services to allow more user-friendly functionality for uploading batches and mapping layouts. Users log into the main product to access the online

More information

Missouri State Highway Patrol. OCN Query Application. Detailed Requirements Specification Version 1.3

Missouri State Highway Patrol. OCN Query Application. Detailed Requirements Specification Version 1.3 Missouri State Highway Patrol OCN Query Application Detailed Requirements Specification Version 1.3 Table of Contents 1 Document Description... 6 1.1 Intent... 6 1.2 Executive Summary... 6 1.3 Overview...

More information

DATA MINING OVERFITTING AND EVALUATION

DATA MINING OVERFITTING AND EVALUATION DATA MINING OVERFITTING AND EVALUATION 1 Overfitting Will cover mechanisms for preventing overfitting in decision trees But some of the mechanisms and concepts will apply to other algorithms 2 Occam s

More information

SSH Compromise Detection using NetFlow/IPFIX. Rick Hofstede, Luuk Hendriks

SSH Compromise Detection using NetFlow/IPFIX. Rick Hofstede, Luuk Hendriks SSH Compromise Detection using NetFlow/IPFIX Rick Hofstede, Luuk Hendriks 51 percent of respondents admitted that their organizations have already been impacted by an SSH key-related compromise in the

More information

Record Linkage 11:35 12:04 (Sharp!)

Record Linkage 11:35 12:04 (Sharp!) Record Linkage 11:35 12:04 (Sharp!) Rich Pinder Los Angeles Cancer Surveillance Program rpinder@usc.edu NAACCR Short Course Central Cancer Registries: Design, Management and Use Presented at the NAACCR

More information

Package riskyr. February 19, 2018

Package riskyr. February 19, 2018 Type Package Title Rendering Risk Literacy more Transparent Version 0.1.0 Date 2018-02-16 Author Hansjoerg Neth [aut, cre], Felix Gaisbauer [aut], Nico Gradwohl [aut], Wolfgang Gaissmaier [aut] Maintainer

More information

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

DATA MINING LECTURE 9. Classification Decision Trees Evaluation DATA MINING LECTURE 9 Classification Decision Trees Evaluation 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium

More information

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Definitions Data. Consider a set A = {A 1,...,A n } of attributes, and an additional

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE In work educational data mining has been used on qualitative data of students and analysis their performance using C4.5 decision tree algorithm.

More information