Data Linkage Methods: Overview of Computer Science Research

Size: px
Start display at page:

Download "Data Linkage Methods: Overview of Computer Science Research"

Transcription

1 Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact: January 2013 p. 1/24

2 Outline Recent interest in data linkage Applications and challenges The data linkage process Types of data linkage techniques Improving scalability: Indexing techniques Improving linkage quality: Learning techniques Synthetic data: Advantages and challenges Privacy-preserving record linkage Challenges and research directions January 2013 p. 2/24

3 Recent interest in data linkage Traditionally, data linkage has been used in statistics (census) and health (epidemiology) In recent years, increased interest from businesses and governments Increased computing power and storage capacities Massive amounts of data are being collected Often data from different sources need to be integrated Need for data sharing between organisations Data mining (analysis) of large data collections E-Commerce and Web applications Geocode matching and spatial data analysis January 2013 p. 3/24

4 Applications of data linkage Remove duplicates in one data set (internal linkage) Merge new records into a larger master data set Create patient or customer oriented statistics (for example for longitudinal studies) Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of data linkage Immigration, taxation, social security, census Fraud, crime and terrorism intelligence Business mailing lists, exchange of customer data Biomedical and social science research January 2013 p. 4/24

5 Data linkage challenges Often no unique entity identifiers are available Real world data are dirty (typographical errors and variations, missing and out-of-date values, different coding schemes, etc.) Scalability Naïve comparison of all record pairs is quadratic Blocking, searching, or filtering is needed (indexing) No training data in many linkage applications (no record pairs with known true match status) Privacy and confidentiality (because personal information, like names and addresses, are commonly required for linking) January 2013 p. 5/24

6 The data linkage process Database A Database B Data pre processing Data pre processing Indexing / Searching Matches Comparison Classif ication Non matches Evaluation Clerical Review Potential Matches January 2013 p. 6/24

7 Types of data linkage techniques Deterministic linkage Exact linkage (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Social security or driver s licence numbers Rule-based linkage (complex to build and maintain) Probabilistic linkage (Fellegi and Sunter, 1969) Use available attributes for linkage (often personal information, like names, addresses, dates of birth, etc.) Discussed in Bill s talk Computer science approaches (based on machine learning, data mining, database, or information retrieval techniques) January 2013 p. 7/24

8 Improving scalability: Indexing Number of record pair comparisons equals the product of the sizes of the two databases (linking two databases containing 1 and 5 million records will result in 1,000,000 5,000,000 = record pairs) Number of true matches is generally less than the number of records in the smaller of the two databases (assuming no duplicate records) Performance bottleneck is usually the (expensive) detailed comparison of attribute values between records (using approximate string comparison functions) Aim of indexing: Cheaply remove record pairs that are obviously not matches January 2013 p. 8/24

9 Traditional blocking Traditional blocking works by only comparing record pairs that have the same value for a blocking variable (for example, only compare records that have the same postcode value) Problems with traditional blocking An erroneous value in a blocking variable results in a record being inserted into the wrong block (several passes with different blocking variables can solve this) Values of blocking variable should have uniform frequencies (as the most frequent values determine the size of the largest blocks) Example: Frequency of Smith in NSW: 25,425 January 2013 p. 9/24

10 Recent indexing approaches (1) Sorted neighbourhood approach Sliding window over sorted databases Use several passes with different blocking variables Q-gram based blocking (e.g. 2-grams / bigrams) Convert values into q-gram lists, then generate sub-lists peter [ pe, et, te, er ], [ pe, et, te ], [ pe, et, er ],.. pete [ pe, et, te ], [ pe, et ], [ pe, te ], [ et, te ],... Each record will be inserted into several blocks Overlapping canopy clustering Based on q-grams and a cheap similarity measure, such as Jaccard (set intersection) or TF-IDF/Cosine January 2013 p. 10/24

11 Recent indexing approaches (2) StringMap based blocking Map strings into a multi-dimensional space such that distances between pairs of strings are preserved Use similarity join to find similar pairs (close strings) Suffix array based blocking Generate suffix array based inverted index (suffix array: peter eter, ter, er, r ) Post-blocking filtering (for example, string length or q-grams count differences) US Census Bureau: BigMatch (pre-process smaller data set so its values can be directly accessed; with all blocking passes in one go) January 2013 p. 11/24

12 Improving linkage quality: Learning techniques View record pair classification as a multidimensional binary classification problem (use numerical attribute similarities to classify record pairs as matches or non-matches) Many machine learning techniques can be used Supervised: Decision trees, SVMs, neural networks, learnable string comparisons, active learning, etc. Un-supervised: Various clustering algorithms Recently, collective classification techniques have been investigated (build graph of database and do overall classification, rather than each record pair independently) January 2013 p. 12/24

13 Collective linkage example Paper 2? w1=? Dave White w3=?? Paper 6 Susan Grey Paper 3 Intel Liz Pink MIT w2=? Don White w4=? Paper 5 John Black Paper 1 CMU Paper 4 Joe Brown (A1, Dave White, Intel) (A2, Don White, CMU) (A3, Susan Grey, MIT) (A4, John Black, MIT) (A5, Joe Brown, unknown) (A6, Liz Pink, unknown) (P1, John Black / Don White) (P2, Sue Grey / D. White) (P3, Dave White) (P4, Don White / Joe Brown) (P5, Joe Brown / Liz Pink) (P6, Liz Pink / D. White) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006 January 2013 p. 13/24

14 Managing transitive closure a1 a2 a3 a4 If record a1 is classified as matching with record a2, and record a2 as matching with record a3, then records a1 and a3 must also be matching. Possibility of record chains occurring Various algorithms have been developed to find optimal solutions (special clustering algorithms) Collective classification deals with this problem by default January 2013 p. 14/24

15 Classification challenges In many cases there is no training data available Possible to use results of earlier linkage projects? Or from manual clerical review process? How confident can we be about correct manual classification of potential matches? Often there is no gold standard available (no data sets with known true match status) No large test data set collection available (like in information retrieval or machine learning) Much computer science research in data linkage is using bibliographic data (which have very different characteristics) January 2013 p. 15/24

16 Synthetic data: Advantages Privacy issues prohibit publication of real personal information De-identified or encrypted data cannot be used to link databases (as real name and address values are required) Several advantages of synthetic data Volume and characteristics can be controlled (errors and variations in records, number of duplicates, etc.) It is known which records are duplicates of each other, and so linkage quality can be calculated Data and the data generator program can be published (allowing others to repeat experiments) January 2013 p. 16/24

17 Synthetic data: Challenges Modelling the content and characteristics of real data (frequencies of values; variations and errors) Modelling dependencies between attributes (for example, given names often depend on gender) Several data generators have been developed Hernandez and Stolfo (mid 1990s): Only based on value tables, no frequencies, simple typographic errors Bertolazzi et al. (2003): Added frequency tables, allowed missing values, still simple error generation Christen et al. (2009): Added look-up tables with misspellings and nicknames, model different error types (typing, phonetic, OCR), attribute dependencies, generate households January 2013 p. 17/24

18 Modelling of variations and errors Printed Handwritten Memory cc (ph) sub, ins, del attr swap, repl cc (ph) sub, ins, del attr swap, repl Dictate Typed cc (ty) sub, ins, del, trans attr swap, repl cc (ocr) sub, ins, del wc split, merge OCR cc (ph) sub, ins, del Electronic document cc (ph and or ty) sub, ins, del, trans cc (ph,ty) sub, ins, del, trans Abbreviations: cc : character change wc : word change subs : substitution ins : insertion wc split, merge - attr swap, repl attr swap, repl Speech recognition del : deletion trans : transpose repl : replace ty : typographic ph : phonetic attr : attribute January 2013 p. 18/24

19 Privacy-preserving record linkage Alice (2) (1) (2) Bob Alice (1) (2) (2) Bob (3) (3) (3) Carol (3) Assume two data sources, and possibly a third (trusted) party to conduct the linkage Objective: No party learns about the other parties private data, only linked records are revealed Various approaches with different assumptions about threats, what can be inferred by parties, and what is being released Based on some form of encoding or encryption techniques January 2013 p. 19/24

20 Privacy technologies Secure hash encoding ( peter 4R#x+Y4i!e@t4o] ) Generalisation techniques (replace specific values with more general ones, using for example k-anonymity) Secure multi-party computation (calculate a function such that parties only learn the final result) Differential privacy (statistical database queries) Bloom filters (e.g. bit-strings for set membership testing) Reference values (e.g. public telephone directory) Phonetic encoding (such as Soundex, NYSIIS, etc.) Add extra random values (hide sensitive real values) January 2013 p. 20/24

21 The state of PPRL Significant advances to achieving the goal of PPRL have been developed in recent years Various approaches based on different techniques Can link records securely, approximately, and in a (somewhat) scalable fashion So far, most PPRL techniques concentrated on approximate matching techniques, and on making PPRL more scalable to large databases However, no large-scale comparative evaluations of PPRL techniques have been published Only limited investigation of classification and linking assessment in PPRL January 2013 p. 21/24

22 Challenges and research directions Improved classification for data linkage (collective classification for personal data) Linking data from many sources Use cloud computing platforms for large-scale data linkage (cost efficiency) Real-time linking and linking dynamic data Develop and implement frameworks for data linkage that allow comparative studies of different techniques (benchmarks) Develop practical PPRL techniques (that facilitate accurate and automatic classification, as well as evaluation of linkage quality and completeness) January 2013 p. 22/24

23 Definition: Privacy-preserving record linkage Assume O 1 O d are the d owners of their respective databases D 1 D d They wish to determine which of their records r i 1 D 1, r j 2 D 2,, and r k d D d, match according to a decision model C(r i 1, rj 2,, rk d ) that classifies pairs (or groups) of records into one of the two classes M of matches, and U of non-matches O 1 O d do not wish to reveal their actual records r i 1 rk d with any other party (they are however prepared to disclose to a (maybe external) selected party the actual values of some attributes of the record pairs that are in class M to allow further analysis) January 2013 p. 23/24

24 Secure multi-party computation Compute a function across several parties, such that no party learns the information from the other parties, but all receive the final results [Yao 1982; Goldreich 1998/2002] Simple example: Secure summation s = i x i. Step 0: Z=999 Step 4: s = 1169 Z = 170 Party 1 x1=55 Step 1: Z+x1= 1054 Party 2 Step 2: (Z+x1)+x2 = 1127 x2=73 Step 3: ((Z+x1)+x2)+x3=1169 Party 3 x3=42 January 2013 p. 24/24

Privacy-Preserving Data Sharing and Matching

Privacy-Preserving Data Sharing and Matching Privacy-Preserving Data Sharing and Matching Peter Christen School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact:

More information

Data Linkage Techniques: Past, Present and Future

Data Linkage Techniques: Past, Present and Future Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,

More information

Probabilistic Deduplication, Record Linkage and Geocoding

Probabilistic Deduplication, Record Linkage and Geocoding Probabilistic Deduplication, Record Linkage and Geocoding Peter Christen Data Mining Group, Australian National University in collaboration with Centre for Epidemiology and Research, New South Wales Department

More information

Performance and scalability of fast blocking techniques for deduplication and data linkage

Performance and scalability of fast blocking techniques for deduplication and data linkage Performance and scalability of fast blocking techniques for deduplication and data linkage Peter Christen Department of Computer Science The Australian National University Canberra ACT, Australia peter.christen@anu.edu.au

More information

Privacy Preserving Probabilistic Record Linkage

Privacy Preserving Probabilistic Record Linkage Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of

More information

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Peter Christen Ross Gayler 2 Department of Computer Science, The Australian National University, Canberra 2,

More information

Outline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I)

Outline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I) Outline Probabilistic Name and Address Cleaning and Standardisation Peter Christen, Tim Churches and Justin Xi Zhu Data Mining Group, Australian National University Centre for Epidemiology and Research,

More information

Adaptive Temporal Entity Resolution on Dynamic Databases

Adaptive Temporal Entity Resolution on Dynamic Databases Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Evaluating Record Linkage Software Using Representative Synthetic Datasets

Evaluating Record Linkage Software Using Representative Synthetic Datasets Evaluating Record Linkage Software Using Representative Synthetic Datasets Benmei Liu, Ph.D. Mandi Yu, Ph.D. Eric J. Feuer, Ph.D. National Cancer Institute Presented at the NAACCR Annual Conference June

More information

Febrl Freely Extensible Biomedical Record Linkage

Febrl Freely Extensible Biomedical Record Linkage Febrl Freely Extensible Biomedical Record Linkage Release 0.4.01 Peter Christen December 13, 2007 Department of Computer Science The Australian National University Canberra ACT 0200 Australia Email: peter.christen@anu.edu.au

More information

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data Int'l Conf. Information and Knowledge Engineering IKE'15 187 Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen

More information

A Probabilistic Deduplication, Record Linkage and Geocoding System

A Probabilistic Deduplication, Record Linkage and Geocoding System A Probabilistic Deduplication, Record Linkage and Geocoding System http://datamining.anu.edu.au/linkage.html Peter Christen 1 and Tim Churches 2 1 Department of Computer Science, Australian National University,

More information

Grouping methods for ongoing record linkage

Grouping methods for ongoing record linkage Grouping methods for ongoing record linkage Sean M. Randall sean.randall@curtin.edu.au James H. Boyd j.boyd@curtin.edu.au Anna M. Ferrante a.ferrante@curtin.edu.au Adrian P. Brown adrian.brown@curtin.edu.au

More information

Data linkages in PEDSnet

Data linkages in PEDSnet 2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background

More information

Some Methods for Blindfolded Record Linkage

Some Methods for Blindfolded Record Linkage Some Methods for Blindfolded Record Linkage Author: Tim Churches and Peter Christen Presentation by Liyuan Han Math&Computer Science Department, Emory Background The linkage of records which refer to the

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit Record Linkage with SAS and Link King Dinu Corbu Queensland Health Health Statistics Centre Integration and Linkage Unit Presented at Queensland Users Exploring SAS Technology QUEST 4 June 2009 Basics

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Efficient Record De-Duplication Identifying Using Febrl Framework

Efficient Record De-Duplication Identifying Using Febrl Framework IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 10, Issue 2 (Mar. - Apr. 2013), PP 22-27 Efficient Record De-Duplication Identifying Using Febrl Framework K.Mala

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining

Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining CS573 Data Privacy and Security Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining Li Xiong Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT

More information

A Survey on Removal of Duplicate Records in Database

A Survey on Removal of Duplicate Records in Database Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,

More information

Single Error Analysis of String Comparison Methods

Single Error Analysis of String Comparison Methods Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 2, Australia peter.christen@anu.edu.au Abstract. Comparing

More information

Accountability in Privacy-Preserving Data Mining

Accountability in Privacy-Preserving Data Mining PORTIA Privacy, Obligations, and Rights in Technologies of Information Assessment Accountability in Privacy-Preserving Data Mining Rebecca Wright Computer Science Department Stevens Institute of Technology

More information

A Comparison of Personal Name Matching: Techniques and Practical Issues

A Comparison of Personal Name Matching: Techniques and Practical Issues A Comparison of Personal Name Matching: Techniques and Practical Issues Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Peter.Christen@anu.edu.au

More information

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03)

Proceedings of the Eighth International Conference on Information Quality (ICIQ-03) Record for a Large Master Client Index at the New York City Health Department Andrew Borthwick ChoiceMaker Technologies andrew.borthwick@choicemaker.com Executive Summary/Abstract: The New York City Department

More information

Assessing Deduplication and Data Linkage Quality: What to Measure?

Assessing Deduplication and Data Linkage Quality: What to Measure? Assessing Deduplication and Data Linkage Quality: What to Measure? http://datamining.anu.edu.au/linkage.html Peter Christen and Karl Goiser Department of Computer Science, Australian National University,

More information

A Clustering-Based Framework to Control Block Sizes for Entity Resolution

A Clustering-Based Framework to Control Block Sizes for Entity Resolution A Clustering-Based Framework to Control Block s for Entity Resolution Jeffrey Fisher Research School of Computer Science Australian National University Canberra ACT 0200 jeffrey.fisher@anu.edu.au Peter

More information

Privacy, Security & Ethical Issues

Privacy, Security & Ethical Issues Privacy, Security & Ethical Issues How do we mine data when we can t even look at it? 2 Individual Privacy Nobody should know more about any entity after the data mining than they did before Approaches:

More information

Sorted Neighborhood Methods Felix Naumann

Sorted Neighborhood Methods Felix Naumann Sorted Neighborhood Methods 2.7.2013 Felix Naumann Duplicate Detection 2 Number of comparisons: All pairs 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 400 comparisons 12

More information

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks

More information

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140 Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019 What is Data Mining? What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational

More information

Parallel computing techniques for high-performance probabilistic record linkage

Parallel computing techniques for high-performance probabilistic record linkage Parallel computing techniques for high-performance probabilistic record linkage Peter Christen Department of Computer Science Markus Hegland, Stephen Roberts and Ole M. Nielsen School of Mathematical Sciences,

More information

Private Record Linkage

Private Record Linkage Undefined 0 (2016) 1 1 IOS Press Private Record Linkage An analysis of the accuracy, efficiency, and security of selected techniques for name matching Pawel Grzebala and Michelle Cheatham Wright State

More information

Knowledge Graph Completion. Mayank Kejriwal (USC/ISI)

Knowledge Graph Completion. Mayank Kejriwal (USC/ISI) Knowledge Graph Completion Mayank Kejriwal (USC/ISI) What is knowledge graph completion? An intelligent way of doing data cleaning Deduplicating entity nodes (entity resolution) Collective reasoning (probabilistic

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Private Record linkage: Comparison of selected techniques for name matching

Private Record linkage: Comparison of selected techniques for name matching Private Record linkage: Comparison of selected techniques for name matching Pawel Grzebala and Michelle Cheatham DaSe Lab, Wright State University, Dayton OH 45435, USA, grzebala.2@wright.edu, michelle.cheatham@wright.edu

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Performance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.

Performance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2. Performance Analysis of KDD Applications using Hardware Event Counters CAP Theme 2 http://cap.anu.edu.au/cap/projects/kddmemperf/ Peter Christen and Adam Czezowski Peter.Christen@anu.edu.au Adam.Czezowski@anu.edu.au

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda

Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda 1 / 14 Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda Stephen E. Fienberg Department of Statistics, Heinz College, and Machine Learning Department, Carnegie Mellon

More information

Deduplication of Hospital Data using Genetic Programming

Deduplication of Hospital Data using Genetic Programming Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department

More information

Matching Algorithms within a Duplicate Detection System

Matching Algorithms within a Duplicate Detection System Matching Algorithms within a Duplicate Detection System Alvaro E. Monge California State University Long Beach Computer Engineering and Computer Science Department, Long Beach, CA, 90840-8302 Abstract

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

A fast approach for parallel deduplication on multicore processors

A fast approach for parallel deduplication on multicore processors A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution

A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution Shouheng Li Supervisor: Huizhi (Elly) Liang u4713006@anu.edu.au The Australian National University Outline Introduction

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Exploiting relationships for domain-independent data cleaning

Exploiting relationships for domain-independent data cleaning Appeared in SIAM International Conference on Data Mining (SDM), April 3, 005 Exploiting relationships for domain-independent data cleaning Dmitri V. Kalashnikov Sharad Mehrotra Zhaoqi Chen Computer Science

More information

DOTNET Projects. DotNet Projects IEEE I. DOTNET based CLOUD COMPUTING. DOTNET based ARTIFICIAL INTELLIGENCE

DOTNET Projects. DotNet Projects IEEE I. DOTNET based CLOUD COMPUTING. DOTNET based ARTIFICIAL INTELLIGENCE DOTNET Projects I. DOTNET based CLOUD COMPUTING 1. Enabling Cloud Storage Auditing with VerifiableOutsourcing of Key Updates (IEEE 2. Conjunctive Keyword Search with Designated Tester and Timing Enabled

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS PAGE TITLE NO. TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS

More information

Privacy Protected Spatial Query Processing

Privacy Protected Spatial Query Processing Privacy Protected Spatial Query Processing Slide 1 Topics Introduction Cloaking-based Solution Transformation-based Solution Private Information Retrieval-based Solution Slide 2 1 Motivation The proliferation

More information

Craig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko

Craig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1 Record Linkage

More information

Introduction to blocking techniques and traditional record linkage

Introduction to blocking techniques and traditional record linkage Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively

More information

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Adaptive Temporal Entity Resolution on Dynamic Databases

Adaptive Temporal Entity Resolution on Dynamic Databases Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross W. Gayler 2 1 Research School of Computer Science, The Australian National University, Canberra ACT 0200, Australia peter.christen@anu.edu.au

More information

Data Migration Platform

Data Migration Platform appmigrate TM Data Migration Platform QUALITY MANAGEMENT RECONCILIATION MAINTENANCE MIGRATION PROFILING COEXISTENCE CUTOVER PLANNING AND EXECUTION ? Problem Data migration done in the traditional way,

More information

Secure Multiparty Computation

Secure Multiparty Computation CS573 Data Privacy and Security Secure Multiparty Computation Problem and security definitions Li Xiong Outline Cryptographic primitives Symmetric Encryption Public Key Encryption Secure Multiparty Computation

More information

Overview of Datavant's De-Identification and Linking Technology for Structured Data

Overview of Datavant's De-Identification and Linking Technology for Structured Data Overview of Datavant's De-Identification and Linking Technology for Structured Data Introduction Datavant is firmly committed to advancing healthcare through data analytics while protecting patients privacy.

More information

High Performance Computing and Data Mining

High Performance Computing and Data Mining High Performance Computing and Data Mining Performance Issues in Data Mining Peter Christen Peter.Christen@anu.edu.au Data Mining Group Department of Computer Science, FEIT Australian National University,

More information

Pre-Requisites: CS2510. NU Core Designations: AD

Pre-Requisites: CS2510. NU Core Designations: AD DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification

More information

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect: Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA

More information

Maritime Union of Australia. Privacy Policy 2014

Maritime Union of Australia. Privacy Policy 2014 Maritime Union of Australia Privacy Policy 2014 Introduction The Maritime Union of Australia (Union) is the Union representing persons employed in diving, ferries, offshore oil and gas, port services,

More information

A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption

A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption Seiichi Ozawa Center for Mathematical Data Science Graduate School of Engineering Kobe University 2 What is PPDM?

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage Thilina Ranbaduge

A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage Thilina Ranbaduge A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage Thilina Ranbaduge A thesis submitted for the degree of Doctor of Philosophy in Computer Science The Australian National

More information

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

A Review on Unsupervised Record Deduplication in Data Warehouse Environment A Review on Unsupervised Record Deduplication in Data Warehouse Environment Keshav Unde, Dhiraj Wadhawa, Bhushan Badave, Aditya Shete, Vaishali Wangikar School of Computer Engineering, MIT Academy of Engineering,

More information

Encrypting PHI for HIPAA Compliance on IBM i. All trademarks and registered trademarks are the property of their respective owners.

Encrypting PHI for HIPAA Compliance on IBM i. All trademarks and registered trademarks are the property of their respective owners. Encrypting PHI for HIPAA Compliance on IBM i HelpSystems LLC. All rights reserved. All trademarks and registered trademarks are the property of their respective owners. Introductions Bob Luebbe, CISSP

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Emerging Measures in Preserving Privacy for Publishing The Data

Emerging Measures in Preserving Privacy for Publishing The Data Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the

More information

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University While your data tables or spreadsheets may look good to

More information

Web Mining Evolution & Comparative Study with Data Mining

Web Mining Evolution & Comparative Study with Data Mining Web Mining Evolution & Comparative Study with Data Mining Anu, Assistant Professor (Resource Person) University Institute of Engineering and Technology Mahrishi Dayanand University Rohtak-124001, India

More information

Data Mining. Jeff M. Phillips. January 9, 2013

Data Mining. Jeff M. Phillips. January 9, 2013 Data Mining Jeff M. Phillips January 9, 2013 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data

More information

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method

Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated

More information

Entity Resolution and Master Data Life Cycle Management in the Era of Big Data

Entity Resolution and Master Data Life Cycle Management in the Era of Big Data ABSTRACT Paper 3920-2015 and Master Data Life Cycle Management in the Era of Big Data John R. Talburt, University of Arkansas at Little Rock & Black Oak Analytics, Inc. Proper management of master data

More information

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Partition Based Perturbation for Privacy Preserving Distributed Data Mining BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 2 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0015 Partition Based Perturbation

More information

Privacy Preserving Group Linkage

Privacy Preserving Group Linkage Privacy Preserving Group Linkage Fengjun Li 1, Yuxin Chen 1, Bo Luo 1, Dongwon Lee 2, and Peng Liu 2 1 EECS Department, University of Kansas, 2 College of IST, Penn State University 1 Record Linkage Record

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin

Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of

More information

Data Mining. Jeff M. Phillips. January 8, 2014

Data Mining. Jeff M. Phillips. January 8, 2014 Data Mining Jeff M. Phillips January 8, 2014 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data

More information

Data Mining: Models and Methods

Data Mining: Models and Methods Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data

More information

Entity Resolution over Graphs

Entity Resolution over Graphs Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,

More information

PPKM: Preserving Privacy in Knowledge Management

PPKM: Preserving Privacy in Knowledge Management PPKM: Preserving Privacy in Knowledge Management N. Maheswari (Corresponding Author) P.G. Department of Computer Science Kongu Arts and Science College, Erode-638-107, Tamil Nadu, India E-mail: mahii_14@yahoo.com

More information

Unsupervised Detection of Duplicates in User Query Results using Blocking

Unsupervised Detection of Duplicates in User Query Results using Blocking Unsupervised Detection of Duplicates in User Query Results using Blocking Dr. B. Vijaya Babu, Professor 1, K. Jyotsna Santoshi 2 1 Professor, Dept. of Computer Science, K L University, Guntur Dt., AP,

More information

Privacy-Preserving Using Data mining Technique in Cloud Computing

Privacy-Preserving Using Data mining Technique in Cloud Computing Cis-601 Graduate Seminar Privacy-Preserving Using Data mining Technique in Cloud Computing Submitted by: Rajan Sharma CSU ID: 2659829 Outline Introduction Related work Preliminaries Association Rule Mining

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel

More information

Techniques for Large Scale Data Linking in SAS. By Damien John Melksham

Techniques for Large Scale Data Linking in SAS. By Damien John Melksham Techniques for Large Scale Data Linking in SAS By Damien John Melksham What is Data Linking? Called everything imaginable: Data linking, record linkage, mergepurge, entity resolution, deduplication, fuzzy

More information