Privacy Preserving Probabilistic Record Linkage

Size: px
Start display at page:

Download "Privacy Preserving Probabilistic Record Linkage"

Transcription

1 Privacy Preserving Probabilistic Record Linkage Duncan Smith Natalie Shlomo Social Statistics, School of Social Sciences University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/ ) under grant agreement n (DwB - Data without Boundaries). 1

2 Topics Covered Introduction Probabilistic Record Linkage String Anonymisation Putting the probabilities back into Privacy Preserving Record Linkage Experiment Discussion 2

3 Introduction Probabilistic record linkage developed by Fellegi and Sunter, 1969 Administrative sources are being used to improve the quality of surveys or to replace traditional censuses Traditionally, all datasets in one location (NSI) and matching variables (first name, last name, address) used to link data without the need for anonymisation Data on individuals may be in distinct databases and may be owned by different custodians: Alice (A) and Bob (B) Privacy restrictions prevent the release of certain variables or information is suppressed/coarsened that uniquely identifies an individual 3

4 Introduction CS Literature, techniques for anonymising identifying variables Third party (Carole) only sees matching variables and returns pairs of unique record IDs (assigned by Alice and Bob) Two possible scenarios (there are more ): Trusted Carole sees the true values of single matching variable Non-trusted Carole sees anonymised values of single matching variable Privacy preserving record linkage (PPRL) allows exact matching and can allow linkage based on similarity scores generated from anonymised values F&S probabilistic record linkage typically not used in PPRL 4

5 Introduction Alice and Bob clean, harmonize and standardize data and anonymise matching variables (using the same method and seed) In our new approach, we apply probabilistic record linkage to anonymised values to obtain a probability of a correct match (PPPRL) Motivation: Data can be held within an archive, users can carry out PPPRL within a black box for dynamic database integration Three party Alice, Bob, Carole scenario as set out in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators In PPRL, no possibility of clerical review and links classified into 2 classes: true matches and false matches 5

6 Probabilistic Record Linkage F&S probabilistic record linkage uses a Binomial EM algorithm based on an agree/disagree indicator { i,i 1..p }) to estimate likelihood ratio Matching score based on the sum of the log of the likelihood ratio: m( ) / u( ) where m( ) is the probability of agree given it s a match and u( ) probability of agree given its not a match String comparators, eg. Jaro-Winkler, are used to adjust the matching score based on partial agreements, eg. typing errors, etc. 6

7 String Anonymisation String anonymisation can use hash functions on bigrams: 'john' {'jo', 'oh', 'hn'} { , , } 'jon' {'jo', 'on'} { , } Minwise hashing (Broder 1997) generates a random permutation of a set of elements and returns the hash for the first ordered element The probability of a hash collision on the first ordered element is the Jaccard similarity score: A B J A, B A B Estimate of Jaccard similarity score based on many hash values where the number of collisions is distributed: n ~ Bin( m,j ) (m number of hash functions) A, B And estimated by n Ĵ A, B m 7

8 String Anonymisation Proposed method: concatenated 1-bit minwise hashing Estimation of the Jaccard similarity score is: n Ĵ A, B 2 1 m Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1={ jo, oh, hn } and S2= { jo, on } H1 H2 H3 H4 H5 Hm S S Sn H1 H2 H3 H4 H5 Hm S S Sn With 5 hash functions, estimate of the Jaccard similarity score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4 8

9 String Anonymisation Simulation Study: File A 300 names, File B obtained by perturbing File A to simulate typographical errors Tokenized bigrams with leading and trailing underscores True Jaccard scores compared with estimated scores on all pairs in A x B Bias in Bloom filter approaches Smaller variance in minwise hash compared to concatenated 1-bit hash but requires more storage Concatenated hash approximately same MSE as Bloom filter Precision can be controlled by choice of m the number of hash functions 9

10 Privacy Preserving Probabilistic Record Linkage Extend Binomial EM Algorithm to K categories, k=1,,k where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1] Element in agreement vector for variable q of pair j j with similarity score in category k, q 1, otherwise 0 Multinomial EM algorithm to estimate matching parameters: mˆ q,k, ûq, k and pˆ,k Blocking: In PPRL literature methods include: canopy clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more... 10

11 Experiment 1000 records from a Census database with attached English names (File A) File B generated by perturbing File A under a probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name 4 Perturbed datasets perturbed at different levels of perturbation A random sample of 700 records from File A and a random sample of 400 records from perturbed files used for matching No blocking was carried out 11

12 Experiment PPPRL: Binary EM: standard EM approach based on exact matching of strings. No similarity score used LR weighted: outputs of Binary EM and downweight likelihood ratios Log LR weighted: outputs of Binary EM and downweight log likelihood ratios EM (8): multinomial EM approach with 9 bins having upper bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams) EM (15): multinomial EM approach with 15 bins having upper bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL: Jaro: multinomial EM approach with 8 bins using Jaro string comparator Jaro-Winkler: As above but with Jaro-Winkler string comparator 12

13 Experiment Correct links identified and used to construct precisionrecall plots Plots show for any given threshold the precision and recall based on false positives, true positives, false negatives, true negatives, and can be used to compare approaches Good approaches will produce curves in the upper right of the plot tp Pr ecision tp fp Re call tp tp fn 13

14 Experiment low perturbation high perturbation All approaches perform better with low level of perturbation Binary EM without similarity scores performs the worst Down weighting log likelihood ratios outperforms down weighting of likelihood ratios Multinomial EM outperforms Binary EM with no clear difference between 8 category and 15 category Jaccard score schemes Jaro schemes provide the best performance, although these are not privacy preserving 14

15 Discussion PPPRL does not allow clerical review and one threshold is determined based on posterior probability of a correct link PPPRL can be tailored to different types of variables via the choice/design of the tokenization scheme So far dealt with 1 to 1 matching Multinomial EM offers improved classification over the unweighted and weighted binary EM schemes Under trusted Carole, the Jaro and Jaro-Winkler schemes outperformed the padded bigram tokenization scheme under PPPRL 15

16 Thank you for your attention 16

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

Overview of Record Linkage for Name Matching

Overview of Record Linkage for Name Matching Overview of Record Linkage for Name Matching W. E. Winkler, william.e.winkler@census.gov NSF Workshop, February 29, 2008 Outline 1. Components of matching process and nuances Match NSF file of Ph.D. recipients

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Data Linkage Techniques: Past, Present and Future

Data Linkage Techniques: Past, Present and Future Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Data linkages in PEDSnet

Data linkages in PEDSnet 2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc. CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems Leigh M. Smith Humtap Inc. leigh@humtap.com Basic system overview Segmentation (Frames, Onsets, Beats, Bars, Chord Changes, etc) Feature

More information

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Kosuke Imai Princeton University Talk at SOSC Seminar Hong Kong University of Science and Technology June 14, 2017 Joint

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Classification Part 4

Classification Part 4 Classification Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Model Evaluation Metrics for Performance Evaluation How to evaluate

More information

Private Record linkage: Comparison of selected techniques for name matching

Private Record linkage: Comparison of selected techniques for name matching Private Record linkage: Comparison of selected techniques for name matching Pawel Grzebala and Michelle Cheatham DaSe Lab, Wright State University, Dayton OH 45435, USA, grzebala.2@wright.edu, michelle.cheatham@wright.edu

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Privacy-Preserving Data Sharing and Matching

Privacy-Preserving Data Sharing and Matching Privacy-Preserving Data Sharing and Matching Peter Christen School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact:

More information

Design-Based Estimation with Record-Linked Administrative Files and a Clerical Review Sample

Design-Based Estimation with Record-Linked Administrative Files and a Clerical Review Sample Journal of Official Statistics, Vol. 34, No. 1, 2018, pp. 41 54, http://dx.doi.org/10.1515/jos-2018-0003 Design-Based Estimation with Record-Linked Administrative Files and a Clerical Review Sample Abel

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

Estimating parameters for probabilistic linkage of privacy-preserved datasets

Estimating parameters for probabilistic linkage of privacy-preserved datasets Brown et al. BMC Medical Research Methodology (2017) 17:95 DOI 10.1186/s12874-017-0370-0 RESEARCH ARTICLE Open Access Estimating parameters for probabilistic linkage of privacy-preserved datasets Adrian

More information

Logistic Regression: Probabilistic Interpretation

Logistic Regression: Probabilistic Interpretation Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),

More information

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage Yun Zhou, John Howroyd, Sebastian Danicic, and J. Mark Bishop Tungsten Centre for Intelligent Data Analytics

More information

Data Linkages - Effect of Data Quality on Linkage Outcomes

Data Linkages - Effect of Data Quality on Linkage Outcomes Data Linkages - Effect of Data Quality on Linkage Outcomes Anders Alexandersson July 27, 2016 Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, 2016 1 / 13 Introduction

More information

Global Probability of Boundary

Global Probability of Boundary Global Probability of Boundary Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues Martin, Fowlkes, Malik Using Contours to Detect and Localize Junctions in Natural

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Cleanup and Statistical Analysis of Sets of National Files

Cleanup and Statistical Analysis of Sets of National Files Cleanup and Statistical Analysis of Sets of National Files William.e.winkler@census.gov FCSM Conference, November 6, 2013 Outline 1. Background on record linkage 2. Background on edit/imputation 3. Current

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

RELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat

RELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat RELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat scannapi@istat.it RELAIS: Milestones Alfa Version (January 2007) Beta RELAIS 1.0 (February 2008) RELAIS 2.0 (June

More information

Machine learning in fmri

Machine learning in fmri Machine learning in fmri Validation Alexandre Savio, Maite Termenón, Manuel Graña 1 Computational Intelligence Group, University of the Basque Country December, 2010 1/18 Outline 1 Motivation The validation

More information

Disambiguating Multiple Links in Historical Record Linkage

Disambiguating Multiple Links in Historical Record Linkage Disambiguating Multiple Links in Historical Record Linkage by Laura Richards A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in

More information

Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau

Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau KEY WORDS string comparator, record linkage, edit distance Abstract We compare

More information

Estimation methods for the integration of administrative sources

Estimation methods for the integration of administrative sources Estimation methods for the integration of administrative sources Task 5b: Review of estimation methods identified in Task 3 a report containing technical summary sheet for each identified estimation/statistical

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS PAGE TITLE NO. TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS

More information

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. Yun Zhou. AMBN-2015, Yokohama, Japan 16/11/2015

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. Yun Zhou. AMBN-2015, Yokohama, Japan 16/11/2015 Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage Yun Zhou AMBN-2015, Yokohama, Japan 16/11/2015 Overview Record Linkage aka Matching aka Merge. Finding records

More information

Private Record Linkage

Private Record Linkage Undefined 0 (2016) 1 1 IOS Press Private Record Linkage An analysis of the accuracy, efficiency, and security of selected techniques for name matching Pawel Grzebala and Michelle Cheatham Wright State

More information

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error INTRODUCTION TO MACHINE LEARNING Measuring model performance or error Is our model any good? Context of task Accuracy Computation time Interpretability 3 types of tasks Classification Regression Clustering

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda 1 / 14 Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda Stephen E. Fienberg Department of Statistics, Heinz College, and Machine Learning Department, Carnegie Mellon

More information

Probabilistic Classifiers DWML, /27

Probabilistic Classifiers DWML, /27 Probabilistic Classifiers DWML, 2007 1/27 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium

More information

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007 Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1 / 29 Introduction Framework

More information

Topic:- DU_J18_MA_STATS_Topic01

Topic:- DU_J18_MA_STATS_Topic01 DU MA MSc Statistics Topic:- DU_J18_MA_STATS_Topic01 1) In analysis of variance problem involving 3 treatments with 10 observations each, SSE= 399.6. Then the MSE is equal to: [Question ID = 2313] 1. 14.8

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

ECE 5470 Classification, Machine Learning, and Neural Network Review

ECE 5470 Classification, Machine Learning, and Neural Network Review ECE 5470 Classification, Machine Learning, and Neural Network Review Due December 1. Solution set Instructions: These questions are to be answered on this document which should be submitted to blackboard

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton Harvard Talk at the Tech Science Seminar IQSS, Harvard University

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

City, University of London Institutional Repository

City, University of London Institutional Repository City Research Online City, University of London Institutional Repository Citation: Schnell, R. (2015). Privacy-preserving Record Linkage. In: K. Harron, H. Goldstein & C. Dibben (Eds.), Methodological

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

CSEP 517 Natural Language Processing Autumn 2013

CSEP 517 Natural Language Processing Autumn 2013 CSEP 517 Natural Language Processing Autumn 2013 Unsupervised and Semi-supervised Learning Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Unsupervised

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Web Information Retrieval. Exercises Evaluation in information retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval Web Information Retrieval Exercises Evaluation in information retrieval Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

Homomorphic Encryption. By Raj Thimmiah

Homomorphic Encryption. By Raj Thimmiah Homomorphic Encryption By Raj Thimmiah Symmetric Key Encryption Symmetric Key Encryption Symmetric Key Encryption: XOR Gates XOR gates are the simplest way to implement symmetric key encryption XOR gates

More information

Project Report: "Bayesian Spam Filter"

Project Report: Bayesian  Spam Filter Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

ECLT 5810 Evaluation of Classification Quality

ECLT 5810 Evaluation of Classification Quality ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:

More information

Package hmeasure. February 20, 2015

Package hmeasure. February 20, 2015 Type Package Package hmeasure February 20, 2015 Title The H-measure and other scalar classification performance metrics Version 1.0 Date 2012-04-30 Author Christoforos Anagnostopoulos

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University

More information

Data Mining Classification: Bayesian Decision Theory

Data Mining Classification: Bayesian Decision Theory Data Mining Classification: Bayesian Decision Theory Lecture Notes for Chapter 2 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd ed. New York: Wiley, 2001. Lecture Notes for Chapter

More information

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit Record Linkage with SAS and Link King Dinu Corbu Queensland Health Health Statistics Centre Integration and Linkage Unit Presented at Queensland Users Exploring SAS Technology QUEST 4 June 2009 Basics

More information

Information Retrieval

Information Retrieval Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 29 Introduction Framework

More information

Package fastlink. February 1, 2018

Package fastlink. February 1, 2018 Type Package Package fastlink February 1, 2018 Title Fast Probabilistic Record Linkage with Missing Data Version 0.3.1 Date 2018-01-31 Implements a Fellegi-Sunter probabilistic record linkage model that

More information

Semi-Joins and Bloom Join. Databases: The Complete Book Ch 20

Semi-Joins and Bloom Join. Databases: The Complete Book Ch 20 Semi-Joins and Bloom Join Databases: The Complete Book Ch 20 1 Practical Concerns UNION R1 S1 R1 S2 R2 S1 RN SM R1 R2 RN S1 S2 SM 2 Practical Concerns UNION R1 S1 R1 S2 R2 S1 RN SM R1 R2 RN S1 S2 SM Where

More information

Emerging Measures in Preserving Privacy for Publishing The Data

Emerging Measures in Preserving Privacy for Publishing The Data Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

Grouping methods for ongoing record linkage

Grouping methods for ongoing record linkage Grouping methods for ongoing record linkage Sean M. Randall sean.randall@curtin.edu.au James H. Boyd j.boyd@curtin.edu.au Anna M. Ferrante a.ferrante@curtin.edu.au Adrian P. Brown adrian.brown@curtin.edu.au

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

Bus Detection and recognition for visually impaired people

Bus Detection and recognition for visually impaired people Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation

More information

CS4491/CS 7265 BIG DATA ANALYTICS

CS4491/CS 7265 BIG DATA ANALYTICS CS4491/CS 7265 BIG DATA ANALYTICS EVALUATION * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Dr. Mingon Kang Computer Science, Kennesaw State University Evaluation for

More information

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated

More information

Probabilistic Deduplication, Record Linkage and Geocoding

Probabilistic Deduplication, Record Linkage and Geocoding Probabilistic Deduplication, Record Linkage and Geocoding Peter Christen Data Mining Group, Australian National University in collaboration with Centre for Epidemiology and Research, New South Wales Department

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Topic: Duplicate Detection and Similarity Computing

Topic: Duplicate Detection and Similarity Computing Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman

More information

Assessing Deduplication and Data Linkage Quality: What to Measure?

Assessing Deduplication and Data Linkage Quality: What to Measure? Assessing Deduplication and Data Linkage Quality: What to Measure? http://datamining.anu.edu.au/linkage.html Peter Christen and Karl Goiser Department of Computer Science, Australian National University,

More information

CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM

CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 82 CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 5.1 INTRODUCTION In this phase, the prime attribute that is taken into consideration is the high dimensionality of the document space.

More information

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel! Institute of Informatics Database Systems Group

More information

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Machine-Learning Methods. Goals for the lecture Evaluating Machine-Learning Methods Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from

More information

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.

Privacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Introduction to Privacy-Preserving Data Publishing Concepts and Techniques Benjamin C M Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S Yu CRC

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information