Privacy Preserving Probabilistic Record Linkage
|
|
- Emil Reed
- 5 years ago
- Views:
Transcription
1 Privacy Preserving Probabilistic Record Linkage Duncan Smith Natalie Shlomo Social Statistics, School of Social Sciences University of Manchester The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/ ) under grant agreement n (DwB - Data without Boundaries). 1
2 Topics Covered Introduction Probabilistic Record Linkage String Anonymisation Putting the probabilities back into Privacy Preserving Record Linkage Experiment Discussion 2
3 Introduction Probabilistic record linkage developed by Fellegi and Sunter, 1969 Administrative sources are being used to improve the quality of surveys or to replace traditional censuses Traditionally, all datasets in one location (NSI) and matching variables (first name, last name, address) used to link data without the need for anonymisation Data on individuals may be in distinct databases and may be owned by different custodians: Alice (A) and Bob (B) Privacy restrictions prevent the release of certain variables or information is suppressed/coarsened that uniquely identifies an individual 3
4 Introduction CS Literature, techniques for anonymising identifying variables Third party (Carole) only sees matching variables and returns pairs of unique record IDs (assigned by Alice and Bob) Two possible scenarios (there are more ): Trusted Carole sees the true values of single matching variable Non-trusted Carole sees anonymised values of single matching variable Privacy preserving record linkage (PPRL) allows exact matching and can allow linkage based on similarity scores generated from anonymised values F&S probabilistic record linkage typically not used in PPRL 4
5 Introduction Alice and Bob clean, harmonize and standardize data and anonymise matching variables (using the same method and seed) In our new approach, we apply probabilistic record linkage to anonymised values to obtain a probability of a correct match (PPPRL) Motivation: Data can be held within an archive, users can carry out PPPRL within a black box for dynamic database integration Three party Alice, Bob, Carole scenario as set out in UK Beyond 2011 project where Carole has access to original values and can calculate string comparators In PPRL, no possibility of clerical review and links classified into 2 classes: true matches and false matches 5
6 Probabilistic Record Linkage F&S probabilistic record linkage uses a Binomial EM algorithm based on an agree/disagree indicator { i,i 1..p }) to estimate likelihood ratio Matching score based on the sum of the log of the likelihood ratio: m( ) / u( ) where m( ) is the probability of agree given it s a match and u( ) probability of agree given its not a match String comparators, eg. Jaro-Winkler, are used to adjust the matching score based on partial agreements, eg. typing errors, etc. 6
7 String Anonymisation String anonymisation can use hash functions on bigrams: 'john' {'jo', 'oh', 'hn'} { , , } 'jon' {'jo', 'on'} { , } Minwise hashing (Broder 1997) generates a random permutation of a set of elements and returns the hash for the first ordered element The probability of a hash collision on the first ordered element is the Jaccard similarity score: A B J A, B A B Estimate of Jaccard similarity score based on many hash values where the number of collisions is distributed: n ~ Bin( m,j ) (m number of hash functions) A, B And estimated by n Ĵ A, B m 7
8 String Anonymisation Proposed method: concatenated 1-bit minwise hashing Estimation of the Jaccard similarity score is: n Ĵ A, B 2 1 m Example: Minwise hashes and 1-bit minwise hashes under a binary representation for S1={ jo, oh, hn } and S2= { jo, on } H1 H2 H3 H4 H5 Hm S S Sn H1 H2 H3 H4 H5 Hm S S Sn With 5 hash functions, estimate of the Jaccard similarity score is 2/5 for minwise hashes and 3/5 for 1-bit hashes; true value is 1/4 8
9 String Anonymisation Simulation Study: File A 300 names, File B obtained by perturbing File A to simulate typographical errors Tokenized bigrams with leading and trailing underscores True Jaccard scores compared with estimated scores on all pairs in A x B Bias in Bloom filter approaches Smaller variance in minwise hash compared to concatenated 1-bit hash but requires more storage Concatenated hash approximately same MSE as Bloom filter Precision can be controlled by choice of m the number of hash functions 9
10 Privacy Preserving Probabilistic Record Linkage Extend Binomial EM Algorithm to K categories, k=1,,k where each category is a grouping of similarity scores (Jaro for original values; Jaccard for anonymised values) i.e. 8 categories with (inclusive) upper bounds: [0.2,0.4,0.6,0.8,0.9,0.95,0.999,1] Element in agreement vector for variable q of pair j j with similarity score in category k, q 1, otherwise 0 Multinomial EM algorithm to estimate matching parameters: mˆ q,k, ûq, k and pˆ,k Blocking: In PPRL literature methods include: canopy clustering (McCallum et al., 2000) which divides the pairs into overlapping subsets before classification; multibit tree structures to identify similar comparison vectors under the Bloom filter framework (Bachteler et al.,2013 ), and more... 10
11 Experiment 1000 records from a Census database with attached English names (File A) File B generated by perturbing File A under a probabilistic approach including swapping, deleting and transposing characters on variables: Gender, Year of Birth, Month of Birth and First Name 4 Perturbed datasets perturbed at different levels of perturbation A random sample of 700 records from File A and a random sample of 400 records from perturbed files used for matching No blocking was carried out 11
12 Experiment PPPRL: Binary EM: standard EM approach based on exact matching of strings. No similarity score used LR weighted: outputs of Binary EM and downweight likelihood ratios Log LR weighted: outputs of Binary EM and downweight log likelihood ratios EM (8): multinomial EM approach with 9 bins having upper bounds [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score (with padded underscores on bigrams) EM (15): multinomial EM approach with 15 bins having upper bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925, 0.95, 0.975, 0.999, 1]. As above PRL: Jaro: multinomial EM approach with 8 bins using Jaro string comparator Jaro-Winkler: As above but with Jaro-Winkler string comparator 12
13 Experiment Correct links identified and used to construct precisionrecall plots Plots show for any given threshold the precision and recall based on false positives, true positives, false negatives, true negatives, and can be used to compare approaches Good approaches will produce curves in the upper right of the plot tp Pr ecision tp fp Re call tp tp fn 13
14 Experiment low perturbation high perturbation All approaches perform better with low level of perturbation Binary EM without similarity scores performs the worst Down weighting log likelihood ratios outperforms down weighting of likelihood ratios Multinomial EM outperforms Binary EM with no clear difference between 8 category and 15 category Jaccard score schemes Jaro schemes provide the best performance, although these are not privacy preserving 14
15 Discussion PPPRL does not allow clerical review and one threshold is determined based on posterior probability of a correct link PPPRL can be tailored to different types of variables via the choice/design of the tokenization scheme So far dealt with 1 to 1 matching Multinomial EM offers improved classification over the unweighted and weighted binary EM schemes Under trusted Carole, the Jaro and Jaro-Winkler schemes outperformed the padded bigram tokenization scheme under PPPRL 15
16 Thank you for your attention 16
Data Linkage Methods: Overview of Computer Science Research
Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,
More informationRLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.
German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center
More informationOverview of Record Linkage Techniques
Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data
More informationOverview of Record Linkage for Name Matching
Overview of Record Linkage for Name Matching W. E. Winkler, william.e.winkler@census.gov NSF Workshop, February 29, 2008 Outline 1. Components of matching process and nuances Match NSF file of Ph.D. recipients
More informationAutomatic training example selection for scalable unsupervised record linkage
Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au
More informationUsing a Probabilistic Model to Assist Merging of Large-scale Administrative Records
Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton University Talk at Seoul National University Fifth Asian Political
More informationEvaluating Classifiers
Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts
More informationData Linkage Techniques: Past, Present and Future
Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html
More informationEvaluating Classifiers
Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts
More informationData linkages in PEDSnet
2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background
More informationRecord Linkage using Probabilistic Methods and Data Mining Techniques
Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University
More informationAutomatic Record Linkage using Seeded Nearest Neighbour and SVM Classification
Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,
More informationAn Ensemble Approach for Record Matching in Data Linkage
Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press
More informationEntity Resolution, Clustering Author References
, Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering
More informationCollective Entity Resolution in Relational Data
Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution
More informationEvaluation Metrics. (Classifiers) CS229 Section Anand Avati
Evaluation Metrics (Classifiers) CS Section Anand Avati Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity,
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationCCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.
CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems Leigh M. Smith Humtap Inc. leigh@humtap.com Basic system overview Segmentation (Frames, Onsets, Beats, Bars, Chord Changes, etc) Feature
More informationIntroduction to blocking techniques and traditional record linkage
Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively
More informationUsing a Probabilistic Model to Assist Merging of Large-scale Administrative Records
Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Kosuke Imai Princeton University Talk at SOSC Seminar Hong Kong University of Science and Technology June 14, 2017 Joint
More informationINF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering
INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,
More informationClassification Part 4
Classification Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Model Evaluation Metrics for Performance Evaluation How to evaluate
More informationPrivate Record linkage: Comparison of selected techniques for name matching
Private Record linkage: Comparison of selected techniques for name matching Pawel Grzebala and Michelle Cheatham DaSe Lab, Wright State University, Dayton OH 45435, USA, grzebala.2@wright.edu, michelle.cheatham@wright.edu
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised
More informationUse of Synthetic Data in Testing Administrative Records Systems
Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationPrivacy-Preserving Data Sharing and Matching
Privacy-Preserving Data Sharing and Matching Peter Christen School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact:
More informationDesign-Based Estimation with Record-Linked Administrative Files and a Clerical Review Sample
Journal of Official Statistics, Vol. 34, No. 1, 2018, pp. 41 54, http://dx.doi.org/10.1515/jos-2018-0003 Design-Based Estimation with Record-Linked Administrative Files and a Clerical Review Sample Abel
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationINF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering
INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.
More informationProbabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules
Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South
More informationEstimating parameters for probabilistic linkage of privacy-preserved datasets
Brown et al. BMC Medical Research Methodology (2017) 17:95 DOI 10.1186/s12874-017-0370-0 RESEARCH ARTICLE Open Access Estimating parameters for probabilistic linkage of privacy-preserved datasets Adrian
More informationLogistic Regression: Probabilistic Interpretation
Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),
More informationExtending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage
Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage Yun Zhou, John Howroyd, Sebastian Danicic, and J. Mark Bishop Tungsten Centre for Intelligent Data Analytics
More informationData Linkages - Effect of Data Quality on Linkage Outcomes
Data Linkages - Effect of Data Quality on Linkage Outcomes Anders Alexandersson July 27, 2016 Anders Alexandersson Data Linkages - Effect of Data Quality on Linkage OutcomesJuly 27, 2016 1 / 13 Introduction
More informationGlobal Probability of Boundary
Global Probability of Boundary Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues Martin, Fowlkes, Malik Using Contours to Detect and Localize Junctions in Natural
More informationList of Exercises: Data Mining 1 December 12th, 2015
List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring
More informationCleanup and Statistical Analysis of Sets of National Files
Cleanup and Statistical Analysis of Sets of National Files William.e.winkler@census.gov FCSM Conference, November 6, 2013 Outline 1. Background on record linkage 2. Background on edit/imputation 3. Current
More informationSecurity Control Methods for Statistical Database
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP
More informationQuality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
More informationOn the automatic classification of app reviews
Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please
More informationRELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat
RELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat scannapi@istat.it RELAIS: Milestones Alfa Version (January 2007) Beta RELAIS 1.0 (February 2008) RELAIS 2.0 (June
More informationMachine learning in fmri
Machine learning in fmri Validation Alexandre Savio, Maite Termenón, Manuel Graña 1 Computational Intelligence Group, University of the Basque Country December, 2010 1/18 Outline 1 Motivation The validation
More informationDisambiguating Multiple Links in Historical Record Linkage
Disambiguating Multiple Links in Historical Record Linkage by Laura Richards A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in
More informationEvaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau
Evaluating String Comparator Performance for Record Linkage William E. Yancey Statistical Research Division U.S. Census Bureau KEY WORDS string comparator, record linkage, edit distance Abstract We compare
More informationEstimation methods for the integration of administrative sources
Estimation methods for the integration of administrative sources Task 5b: Review of estimation methods identified in Task 3 a report containing technical summary sheet for each identified estimation/statistical
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data
More informationTABLE OF CONTENTS PAGE TITLE NO.
TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS
More informationExtending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. Yun Zhou. AMBN-2015, Yokohama, Japan 16/11/2015
Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage Yun Zhou AMBN-2015, Yokohama, Japan 16/11/2015 Overview Record Linkage aka Matching aka Merge. Finding records
More informationPrivate Record Linkage
Undefined 0 (2016) 1 1 IOS Press Private Record Linkage An analysis of the accuracy, efficiency, and security of selected techniques for name matching Pawel Grzebala and Michelle Cheatham Wright State
More informationINTRODUCTION TO MACHINE LEARNING. Measuring model performance or error
INTRODUCTION TO MACHINE LEARNING Measuring model performance or error Is our model any good? Context of task Accuracy Computation time Interpretability 3 types of tasks Classification Regression Clustering
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationRecord Linkage for the American Opportunity Study: Formal Framework and Research Agenda
1 / 14 Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda Stephen E. Fienberg Department of Statistics, Heinz College, and Machine Learning Department, Carnegie Mellon
More informationProbabilistic Classifiers DWML, /27
Probabilistic Classifiers DWML, 2007 1/27 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium
More informationInformation Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007
Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1 / 29 Introduction Framework
More informationTopic:- DU_J18_MA_STATS_Topic01
DU MA MSc Statistics Topic:- DU_J18_MA_STATS_Topic01 1) In analysis of variance problem involving 3 treatments with 10 observations each, SSE= 399.6. Then the MSE is equal to: [Question ID = 2313] 1. 14.8
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationECE 5470 Classification, Machine Learning, and Neural Network Review
ECE 5470 Classification, Machine Learning, and Neural Network Review Due December 1. Solution set Instructions: These questions are to be answered on this document which should be submitted to blackboard
More informationA Bagging Method using Decision Trees in the Role of Base Classifiers
A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,
More informationUsing a Probabilistic Model to Assist Merging of Large-scale Administrative Records
Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records Ted Enamorado Benjamin Fifield Kosuke Imai Princeton Harvard Talk at the Tech Science Seminar IQSS, Harvard University
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationCity, University of London Institutional Repository
City Research Online City, University of London Institutional Repository Citation: Schnell, R. (2015). Privacy-preserving Record Linkage. In: K. Harron, H. Goldstein & C. Dibben (Eds.), Methodological
More informationTour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers
Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationCSEP 517 Natural Language Processing Autumn 2013
CSEP 517 Natural Language Processing Autumn 2013 Unsupervised and Semi-supervised Learning Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Unsupervised
More informationA Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression
Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study
More informationChapter 6 Evaluation Metrics and Evaluation
Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific
More informationWeb Information Retrieval. Exercises Evaluation in information retrieval
Web Information Retrieval Exercises Evaluation in information retrieval Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need
More informationClassification. 1 o Semestre 2007/2008
Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class
More informationnode2vec: Scalable Feature Learning for Networks
node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database
More informationHomomorphic Encryption. By Raj Thimmiah
Homomorphic Encryption By Raj Thimmiah Symmetric Key Encryption Symmetric Key Encryption Symmetric Key Encryption: XOR Gates XOR gates are the simplest way to implement symmetric key encryption XOR gates
More informationProject Report: "Bayesian Spam Filter"
Humboldt-Universität zu Berlin Lehrstuhl für Maschinelles Lernen Sommersemester 2016 Maschinelles Lernen 1 Project Report: "Bayesian E-Mail Spam Filter" The Bayesians Sabine Bertram, Carolina Gumuljo,
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationECLT 5810 Evaluation of Classification Quality
ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:
More informationPackage hmeasure. February 20, 2015
Type Package Package hmeasure February 20, 2015 Title The H-measure and other scalar classification performance metrics Version 1.0 Date 2012-04-30 Author Christoforos Anagnostopoulos
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationSupplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion
Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion Lalit P. Jain, Walter J. Scheirer,2, and Terrance E. Boult,3 University of Colorado Colorado Springs 2 Harvard University
More informationData Mining Classification: Bayesian Decision Theory
Data Mining Classification: Bayesian Decision Theory Lecture Notes for Chapter 2 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd ed. New York: Wiley, 2001. Lecture Notes for Chapter
More informationRecord Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit
Record Linkage with SAS and Link King Dinu Corbu Queensland Health Health Statistics Centre Integration and Linkage Unit Presented at Queensland Users Exploring SAS Technology QUEST 4 June 2009 Basics
More informationInformation Retrieval
Information Retrieval Lecture 7 - Evaluation in Information Retrieval Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 29 Introduction Framework
More informationPackage fastlink. February 1, 2018
Type Package Package fastlink February 1, 2018 Title Fast Probabilistic Record Linkage with Missing Data Version 0.3.1 Date 2018-01-31 Implements a Fellegi-Sunter probabilistic record linkage model that
More informationSemi-Joins and Bloom Join. Databases: The Complete Book Ch 20
Semi-Joins and Bloom Join Databases: The Complete Book Ch 20 1 Practical Concerns UNION R1 S1 R1 S2 R2 S1 RN SM R1 R2 RN S1 S2 SM 2 Practical Concerns UNION R1 S1 R1 S2 R2 S1 RN SM R1 R2 RN S1 S2 SM Where
More informationEmerging Measures in Preserving Privacy for Publishing The Data
Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the
More informationDATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing
More informationGrouping methods for ongoing record linkage
Grouping methods for ongoing record linkage Sean M. Randall sean.randall@curtin.edu.au James H. Boyd j.boyd@curtin.edu.au Anna M. Ferrante a.ferrante@curtin.edu.au Adrian P. Brown adrian.brown@curtin.edu.au
More informationChapter 8. Evaluating Search Engine
Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can
More informationBus Detection and recognition for visually impaired people
Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation
More informationCS4491/CS 7265 BIG DATA ANALYTICS
CS4491/CS 7265 BIG DATA ANALYTICS EVALUATION * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Dr. Mingon Kang Computer Science, Kennesaw State University Evaluation for
More informationAutomated Information Retrieval System Using Correlation Based Multi- Document Summarization Method
Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated
More informationProbabilistic Deduplication, Record Linkage and Geocoding
Probabilistic Deduplication, Record Linkage and Geocoding Peter Christen Data Mining Group, Australian National University in collaboration with Centre for Epidemiology and Research, New South Wales Department
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 124 Section #8 Hashing, Skip Lists 3/20/17 1 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationTopic: Duplicate Detection and Similarity Computing
Table of Content Topic: Duplicate Detection and Similarity Computing Motivation Shingling for duplicate comparison Minhashing LSH UCSB 290N, 2013 Tao Yang Some of slides are from text book [CMS] and Rajaraman/Ullman
More informationAssessing Deduplication and Data Linkage Quality: What to Measure?
Assessing Deduplication and Data Linkage Quality: What to Measure? http://datamining.anu.edu.au/linkage.html Peter Christen and Karl Goiser Department of Computer Science, Australian National University,
More informationCHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM
82 CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 5.1 INTRODUCTION In this phase, the prime attribute that is taken into consideration is the high dimensionality of the document space.
More informationSigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel! Institute of Informatics Database Systems Group
More informationEvaluating Machine-Learning Methods. Goals for the lecture
Evaluating Machine-Learning Methods Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from
More informationPrivacy-Preserving. Introduction to. Data Publishing. Concepts and Techniques. Benjamin C. M. Fung, Ke Wang, Chapman & Hall/CRC. S.
Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Introduction to Privacy-Preserving Data Publishing Concepts and Techniques Benjamin C M Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S Yu CRC
More informationStream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
More informationLarge Scale Data Analysis Using Deep Learning
Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting
More information