Data Linkage Methods: Overview of Computer Science Research
|
|
- Rudolf Lyons
- 6 years ago
- Views:
Transcription
1 Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact: January 2013 p. 1/24
2 Outline Recent interest in data linkage Applications and challenges The data linkage process Types of data linkage techniques Improving scalability: Indexing techniques Improving linkage quality: Learning techniques Synthetic data: Advantages and challenges Privacy-preserving record linkage Challenges and research directions January 2013 p. 2/24
3 Recent interest in data linkage Traditionally, data linkage has been used in statistics (census) and health (epidemiology) In recent years, increased interest from businesses and governments Increased computing power and storage capacities Massive amounts of data are being collected Often data from different sources need to be integrated Need for data sharing between organisations Data mining (analysis) of large data collections E-Commerce and Web applications Geocode matching and spatial data analysis January 2013 p. 3/24
4 Applications of data linkage Remove duplicates in one data set (internal linkage) Merge new records into a larger master data set Create patient or customer oriented statistics (for example for longitudinal studies) Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of data linkage Immigration, taxation, social security, census Fraud, crime and terrorism intelligence Business mailing lists, exchange of customer data Biomedical and social science research January 2013 p. 4/24
5 Data linkage challenges Often no unique entity identifiers are available Real world data are dirty (typographical errors and variations, missing and out-of-date values, different coding schemes, etc.) Scalability Naïve comparison of all record pairs is quadratic Blocking, searching, or filtering is needed (indexing) No training data in many linkage applications (no record pairs with known true match status) Privacy and confidentiality (because personal information, like names and addresses, are commonly required for linking) January 2013 p. 5/24
6 The data linkage process Database A Database B Data pre processing Data pre processing Indexing / Searching Matches Comparison Classif ication Non matches Evaluation Clerical Review Potential Matches January 2013 p. 6/24
7 Types of data linkage techniques Deterministic linkage Exact linkage (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Social security or driver s licence numbers Rule-based linkage (complex to build and maintain) Probabilistic linkage (Fellegi and Sunter, 1969) Use available attributes for linkage (often personal information, like names, addresses, dates of birth, etc.) Discussed in Bill s talk Computer science approaches (based on machine learning, data mining, database, or information retrieval techniques) January 2013 p. 7/24
8 Improving scalability: Indexing Number of record pair comparisons equals the product of the sizes of the two databases (linking two databases containing 1 and 5 million records will result in 1,000,000 5,000,000 = record pairs) Number of true matches is generally less than the number of records in the smaller of the two databases (assuming no duplicate records) Performance bottleneck is usually the (expensive) detailed comparison of attribute values between records (using approximate string comparison functions) Aim of indexing: Cheaply remove record pairs that are obviously not matches January 2013 p. 8/24
9 Traditional blocking Traditional blocking works by only comparing record pairs that have the same value for a blocking variable (for example, only compare records that have the same postcode value) Problems with traditional blocking An erroneous value in a blocking variable results in a record being inserted into the wrong block (several passes with different blocking variables can solve this) Values of blocking variable should have uniform frequencies (as the most frequent values determine the size of the largest blocks) Example: Frequency of Smith in NSW: 25,425 January 2013 p. 9/24
10 Recent indexing approaches (1) Sorted neighbourhood approach Sliding window over sorted databases Use several passes with different blocking variables Q-gram based blocking (e.g. 2-grams / bigrams) Convert values into q-gram lists, then generate sub-lists peter [ pe, et, te, er ], [ pe, et, te ], [ pe, et, er ],.. pete [ pe, et, te ], [ pe, et ], [ pe, te ], [ et, te ],... Each record will be inserted into several blocks Overlapping canopy clustering Based on q-grams and a cheap similarity measure, such as Jaccard (set intersection) or TF-IDF/Cosine January 2013 p. 10/24
11 Recent indexing approaches (2) StringMap based blocking Map strings into a multi-dimensional space such that distances between pairs of strings are preserved Use similarity join to find similar pairs (close strings) Suffix array based blocking Generate suffix array based inverted index (suffix array: peter eter, ter, er, r ) Post-blocking filtering (for example, string length or q-grams count differences) US Census Bureau: BigMatch (pre-process smaller data set so its values can be directly accessed; with all blocking passes in one go) January 2013 p. 11/24
12 Improving linkage quality: Learning techniques View record pair classification as a multidimensional binary classification problem (use numerical attribute similarities to classify record pairs as matches or non-matches) Many machine learning techniques can be used Supervised: Decision trees, SVMs, neural networks, learnable string comparisons, active learning, etc. Un-supervised: Various clustering algorithms Recently, collective classification techniques have been investigated (build graph of database and do overall classification, rather than each record pair independently) January 2013 p. 12/24
13 Collective linkage example Paper 2? w1=? Dave White w3=?? Paper 6 Susan Grey Paper 3 Intel Liz Pink MIT w2=? Don White w4=? Paper 5 John Black Paper 1 CMU Paper 4 Joe Brown (A1, Dave White, Intel) (A2, Don White, CMU) (A3, Susan Grey, MIT) (A4, John Black, MIT) (A5, Joe Brown, unknown) (A6, Liz Pink, unknown) (P1, John Black / Don White) (P2, Sue Grey / D. White) (P3, Dave White) (P4, Don White / Joe Brown) (P5, Joe Brown / Liz Pink) (P6, Liz Pink / D. White) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006 January 2013 p. 13/24
14 Managing transitive closure a1 a2 a3 a4 If record a1 is classified as matching with record a2, and record a2 as matching with record a3, then records a1 and a3 must also be matching. Possibility of record chains occurring Various algorithms have been developed to find optimal solutions (special clustering algorithms) Collective classification deals with this problem by default January 2013 p. 14/24
15 Classification challenges In many cases there is no training data available Possible to use results of earlier linkage projects? Or from manual clerical review process? How confident can we be about correct manual classification of potential matches? Often there is no gold standard available (no data sets with known true match status) No large test data set collection available (like in information retrieval or machine learning) Much computer science research in data linkage is using bibliographic data (which have very different characteristics) January 2013 p. 15/24
16 Synthetic data: Advantages Privacy issues prohibit publication of real personal information De-identified or encrypted data cannot be used to link databases (as real name and address values are required) Several advantages of synthetic data Volume and characteristics can be controlled (errors and variations in records, number of duplicates, etc.) It is known which records are duplicates of each other, and so linkage quality can be calculated Data and the data generator program can be published (allowing others to repeat experiments) January 2013 p. 16/24
17 Synthetic data: Challenges Modelling the content and characteristics of real data (frequencies of values; variations and errors) Modelling dependencies between attributes (for example, given names often depend on gender) Several data generators have been developed Hernandez and Stolfo (mid 1990s): Only based on value tables, no frequencies, simple typographic errors Bertolazzi et al. (2003): Added frequency tables, allowed missing values, still simple error generation Christen et al. (2009): Added look-up tables with misspellings and nicknames, model different error types (typing, phonetic, OCR), attribute dependencies, generate households January 2013 p. 17/24
18 Modelling of variations and errors Printed Handwritten Memory cc (ph) sub, ins, del attr swap, repl cc (ph) sub, ins, del attr swap, repl Dictate Typed cc (ty) sub, ins, del, trans attr swap, repl cc (ocr) sub, ins, del wc split, merge OCR cc (ph) sub, ins, del Electronic document cc (ph and or ty) sub, ins, del, trans cc (ph,ty) sub, ins, del, trans Abbreviations: cc : character change wc : word change subs : substitution ins : insertion wc split, merge - attr swap, repl attr swap, repl Speech recognition del : deletion trans : transpose repl : replace ty : typographic ph : phonetic attr : attribute January 2013 p. 18/24
19 Privacy-preserving record linkage Alice (2) (1) (2) Bob Alice (1) (2) (2) Bob (3) (3) (3) Carol (3) Assume two data sources, and possibly a third (trusted) party to conduct the linkage Objective: No party learns about the other parties private data, only linked records are revealed Various approaches with different assumptions about threats, what can be inferred by parties, and what is being released Based on some form of encoding or encryption techniques January 2013 p. 19/24
20 Privacy technologies Secure hash encoding ( peter 4R#x+Y4i!e@t4o] ) Generalisation techniques (replace specific values with more general ones, using for example k-anonymity) Secure multi-party computation (calculate a function such that parties only learn the final result) Differential privacy (statistical database queries) Bloom filters (e.g. bit-strings for set membership testing) Reference values (e.g. public telephone directory) Phonetic encoding (such as Soundex, NYSIIS, etc.) Add extra random values (hide sensitive real values) January 2013 p. 20/24
21 The state of PPRL Significant advances to achieving the goal of PPRL have been developed in recent years Various approaches based on different techniques Can link records securely, approximately, and in a (somewhat) scalable fashion So far, most PPRL techniques concentrated on approximate matching techniques, and on making PPRL more scalable to large databases However, no large-scale comparative evaluations of PPRL techniques have been published Only limited investigation of classification and linking assessment in PPRL January 2013 p. 21/24
22 Challenges and research directions Improved classification for data linkage (collective classification for personal data) Linking data from many sources Use cloud computing platforms for large-scale data linkage (cost efficiency) Real-time linking and linking dynamic data Develop and implement frameworks for data linkage that allow comparative studies of different techniques (benchmarks) Develop practical PPRL techniques (that facilitate accurate and automatic classification, as well as evaluation of linkage quality and completeness) January 2013 p. 22/24
23 Definition: Privacy-preserving record linkage Assume O 1 O d are the d owners of their respective databases D 1 D d They wish to determine which of their records r i 1 D 1, r j 2 D 2,, and r k d D d, match according to a decision model C(r i 1, rj 2,, rk d ) that classifies pairs (or groups) of records into one of the two classes M of matches, and U of non-matches O 1 O d do not wish to reveal their actual records r i 1 rk d with any other party (they are however prepared to disclose to a (maybe external) selected party the actual values of some attributes of the record pairs that are in class M to allow further analysis) January 2013 p. 23/24
24 Secure multi-party computation Compute a function across several parties, such that no party learns the information from the other parties, but all receive the final results [Yao 1982; Goldreich 1998/2002] Simple example: Secure summation s = i x i. Step 0: Z=999 Step 4: s = 1169 Z = 170 Party 1 x1=55 Step 1: Z+x1= 1054 Party 2 Step 2: (Z+x1)+x2 = 1127 x2=73 Step 3: ((Z+x1)+x2)+x3=1169 Party 3 x3=42 January 2013 p. 24/24
Privacy-Preserving Data Sharing and Matching
Privacy-Preserving Data Sharing and Matching Peter Christen School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact:
More informationData Linkage Techniques: Past, Present and Future
Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html
More informationAutomatic training example selection for scalable unsupervised record linkage
Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au
More informationAutomatic Record Linkage using Seeded Nearest Neighbour and SVM Classification
Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,
More informationProbabilistic Deduplication, Record Linkage and Geocoding
Probabilistic Deduplication, Record Linkage and Geocoding Peter Christen Data Mining Group, Australian National University in collaboration with Centre for Epidemiology and Research, New South Wales Department
More informationPerformance and scalability of fast blocking techniques for deduplication and data linkage
Performance and scalability of fast blocking techniques for deduplication and data linkage Peter Christen Department of Computer Science The Australian National University Canberra ACT, Australia peter.christen@anu.edu.au
More informationPrivacy Preserving Probabilistic Record Linkage
Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of
More informationTowards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Peter Christen Ross Gayler 2 Department of Computer Science, The Australian National University, Canberra 2,
More informationOutline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I)
Outline Probabilistic Name and Address Cleaning and Standardisation Peter Christen, Tim Churches and Justin Xi Zhu Data Mining Group, Australian National University Centre for Epidemiology and Research,
More informationAdaptive Temporal Entity Resolution on Dynamic Databases
Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National
More informationQuality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
More informationEvaluating Record Linkage Software Using Representative Synthetic Datasets
Evaluating Record Linkage Software Using Representative Synthetic Datasets Benmei Liu, Ph.D. Mandi Yu, Ph.D. Eric J. Feuer, Ph.D. National Cancer Institute Presented at the NAACCR Annual Conference June
More informationFebrl Freely Extensible Biomedical Record Linkage
Febrl Freely Extensible Biomedical Record Linkage Release 0.4.01 Peter Christen December 13, 2007 Department of Computer Science The Australian National University Canberra ACT 0200 Australia Email: peter.christen@anu.edu.au
More informationApplying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data
Int'l Conf. Information and Knowledge Engineering IKE'15 187 Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen
More informationA Probabilistic Deduplication, Record Linkage and Geocoding System
A Probabilistic Deduplication, Record Linkage and Geocoding System http://datamining.anu.edu.au/linkage.html Peter Christen 1 and Tim Churches 2 1 Department of Computer Science, Australian National University,
More informationGrouping methods for ongoing record linkage
Grouping methods for ongoing record linkage Sean M. Randall sean.randall@curtin.edu.au James H. Boyd j.boyd@curtin.edu.au Anna M. Ferrante a.ferrante@curtin.edu.au Adrian P. Brown adrian.brown@curtin.edu.au
More informationData linkages in PEDSnet
2016/2017 CRISP Seminar Series - Part IV Data linkages in PEDSnet Toan C. Ong, PhD Assistant Professor Department of Pediatrics University of Colorado, Anschutz Medical Campus Content Record linkage background
More informationSome Methods for Blindfolded Record Linkage
Some Methods for Blindfolded Record Linkage Author: Tim Churches and Peter Christen Presentation by Liyuan Han Math&Computer Science Department, Emory Background The linkage of records which refer to the
More informationOverview of Record Linkage Techniques
Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data
More informationEntity Resolution, Clustering Author References
, Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering
More informationRecord Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit
Record Linkage with SAS and Link King Dinu Corbu Queensland Health Health Statistics Centre Integration and Linkage Unit Presented at Queensland Users Exploring SAS Technology QUEST 4 June 2009 Basics
More informationRecord Linkage using Probabilistic Methods and Data Mining Techniques
Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University
More informationEfficient Record De-Duplication Identifying Using Febrl Framework
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 10, Issue 2 (Mar. - Apr. 2013), PP 22-27 Efficient Record De-Duplication Identifying Using Febrl Framework K.Mala
More informationUse of Synthetic Data in Testing Administrative Records Systems
Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive
More informationSecure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining
CS573 Data Privacy and Security Secure Multiparty Computation Introduction to Privacy Preserving Distributed Data Mining Li Xiong Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT
More informationA Survey on Removal of Duplicate Records in Database
Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,
More informationSingle Error Analysis of String Comparison Methods
Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 2, Australia peter.christen@anu.edu.au Abstract. Comparing
More informationAccountability in Privacy-Preserving Data Mining
PORTIA Privacy, Obligations, and Rights in Technologies of Information Assessment Accountability in Privacy-Preserving Data Mining Rebecca Wright Computer Science Department Stevens Institute of Technology
More informationA Comparison of Personal Name Matching: Techniques and Practical Issues
A Comparison of Personal Name Matching: Techniques and Practical Issues Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Peter.Christen@anu.edu.au
More informationProceedings of the Eighth International Conference on Information Quality (ICIQ-03)
Record for a Large Master Client Index at the New York City Health Department Andrew Borthwick ChoiceMaker Technologies andrew.borthwick@choicemaker.com Executive Summary/Abstract: The New York City Department
More informationAssessing Deduplication and Data Linkage Quality: What to Measure?
Assessing Deduplication and Data Linkage Quality: What to Measure? http://datamining.anu.edu.au/linkage.html Peter Christen and Karl Goiser Department of Computer Science, Australian National University,
More informationA Clustering-Based Framework to Control Block Sizes for Entity Resolution
A Clustering-Based Framework to Control Block s for Entity Resolution Jeffrey Fisher Research School of Computer Science Australian National University Canberra ACT 0200 jeffrey.fisher@anu.edu.au Peter
More informationPrivacy, Security & Ethical Issues
Privacy, Security & Ethical Issues How do we mine data when we can t even look at it? 2 Individual Privacy Nobody should know more about any entity after the data mining than they did before Approaches:
More informationSorted Neighborhood Methods Felix Naumann
Sorted Neighborhood Methods 2.7.2013 Felix Naumann Duplicate Detection 2 Number of comparisons: All pairs 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 400 comparisons 12
More informationPrivacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University
Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks
More informationData Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140
Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019 What is Data Mining? What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational
More informationParallel computing techniques for high-performance probabilistic record linkage
Parallel computing techniques for high-performance probabilistic record linkage Peter Christen Department of Computer Science Markus Hegland, Stephen Roberts and Ole M. Nielsen School of Mathematical Sciences,
More informationPrivate Record Linkage
Undefined 0 (2016) 1 1 IOS Press Private Record Linkage An analysis of the accuracy, efficiency, and security of selected techniques for name matching Pawel Grzebala and Michelle Cheatham Wright State
More informationKnowledge Graph Completion. Mayank Kejriwal (USC/ISI)
Knowledge Graph Completion Mayank Kejriwal (USC/ISI) What is knowledge graph completion? An intelligent way of doing data cleaning Deduplicating entity nodes (entity resolution) Collective reasoning (probabilistic
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationLink Mining & Entity Resolution. Lise Getoor University of Maryland, College Park
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous
More informationPrivate Record linkage: Comparison of selected techniques for name matching
Private Record linkage: Comparison of selected techniques for name matching Pawel Grzebala and Michelle Cheatham DaSe Lab, Wright State University, Dayton OH 45435, USA, grzebala.2@wright.edu, michelle.cheatham@wright.edu
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationPerformance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.
Performance Analysis of KDD Applications using Hardware Event Counters CAP Theme 2 http://cap.anu.edu.au/cap/projects/kddmemperf/ Peter Christen and Adam Czezowski Peter.Christen@anu.edu.au Adam.Czezowski@anu.edu.au
More informationCollective Entity Resolution in Relational Data
Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationRecord Linkage for the American Opportunity Study: Formal Framework and Research Agenda
1 / 14 Record Linkage for the American Opportunity Study: Formal Framework and Research Agenda Stephen E. Fienberg Department of Statistics, Heinz College, and Machine Learning Department, Carnegie Mellon
More informationDeduplication of Hospital Data using Genetic Programming
Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department
More informationMatching Algorithms within a Duplicate Detection System
Matching Algorithms within a Duplicate Detection System Alvaro E. Monge California State University Long Beach Computer Engineering and Computer Science Department, Long Beach, CA, 90840-8302 Abstract
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More informationA fast approach for parallel deduplication on multicore processors
A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationA Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution
A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution Shouheng Li Supervisor: Huizhi (Elly) Liang u4713006@anu.edu.au The Australian National University Outline Introduction
More informationRLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.
German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center
More informationExploiting relationships for domain-independent data cleaning
Appeared in SIAM International Conference on Data Mining (SDM), April 3, 005 Exploiting relationships for domain-independent data cleaning Dmitri V. Kalashnikov Sharad Mehrotra Zhaoqi Chen Computer Science
More informationDOTNET Projects. DotNet Projects IEEE I. DOTNET based CLOUD COMPUTING. DOTNET based ARTIFICIAL INTELLIGENCE
DOTNET Projects I. DOTNET based CLOUD COMPUTING 1. Enabling Cloud Storage Auditing with VerifiableOutsourcing of Key Updates (IEEE 2. Conjunctive Keyword Search with Designated Tester and Timing Enabled
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationTABLE OF CONTENTS PAGE TITLE NO.
TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS
More informationPrivacy Protected Spatial Query Processing
Privacy Protected Spatial Query Processing Slide 1 Topics Introduction Cloaking-based Solution Transformation-based Solution Private Information Retrieval-based Solution Slide 2 1 Motivation The proliferation
More informationCraig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko
Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1 Record Linkage
More informationIntroduction to blocking techniques and traditional record linkage
Introduction to blocking techniques and traditional record linkage Brenda Betancourt Duke University Department of Statistical Science bb222@stat.duke.edu May 2, 2018 1 / 32 Blocking: Motivation Naively
More informationEntity Resolution with Heavy Indexing
Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationAdaptive Temporal Entity Resolution on Dynamic Databases
Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross W. Gayler 2 1 Research School of Computer Science, The Australian National University, Canberra ACT 0200, Australia peter.christen@anu.edu.au
More informationData Migration Platform
appmigrate TM Data Migration Platform QUALITY MANAGEMENT RECONCILIATION MAINTENANCE MIGRATION PROFILING COEXISTENCE CUTOVER PLANNING AND EXECUTION ? Problem Data migration done in the traditional way,
More informationSecure Multiparty Computation
CS573 Data Privacy and Security Secure Multiparty Computation Problem and security definitions Li Xiong Outline Cryptographic primitives Symmetric Encryption Public Key Encryption Secure Multiparty Computation
More informationOverview of Datavant's De-Identification and Linking Technology for Structured Data
Overview of Datavant's De-Identification and Linking Technology for Structured Data Introduction Datavant is firmly committed to advancing healthcare through data analytics while protecting patients privacy.
More informationHigh Performance Computing and Data Mining
High Performance Computing and Data Mining Performance Issues in Data Mining Peter Christen Peter.Christen@anu.edu.au Data Mining Group Department of Computer Science, FEIT Australian National University,
More informationPre-Requisites: CS2510. NU Core Designations: AD
DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification
More informationOutline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:
Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA
More informationMaritime Union of Australia. Privacy Policy 2014
Maritime Union of Australia Privacy Policy 2014 Introduction The Maritime Union of Australia (Union) is the Union representing persons employed in diving, ferries, offshore oil and gas, port services,
More informationA Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption
A Machine Learning Approach to Privacy-Preserving Data Mining Using Homomorphic Encryption Seiichi Ozawa Center for Mathematical Data Science Graduate School of Engineering Kobe University 2 What is PPDM?
More informationStream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
More informationA Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage Thilina Ranbaduge
A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage Thilina Ranbaduge A thesis submitted for the degree of Doctor of Philosophy in Computer Science The Australian National
More informationA Review on Unsupervised Record Deduplication in Data Warehouse Environment
A Review on Unsupervised Record Deduplication in Data Warehouse Environment Keshav Unde, Dhiraj Wadhawa, Bhushan Badave, Aditya Shete, Vaishali Wangikar School of Computer Engineering, MIT Academy of Engineering,
More informationEncrypting PHI for HIPAA Compliance on IBM i. All trademarks and registered trademarks are the property of their respective owners.
Encrypting PHI for HIPAA Compliance on IBM i HelpSystems LLC. All rights reserved. All trademarks and registered trademarks are the property of their respective owners. Introductions Bob Luebbe, CISSP
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationEmerging Measures in Preserving Privacy for Publishing The Data
Emerging Measures in Preserving Privacy for Publishing The Data K.SIVARAMAN 1 Assistant Professor, Dept. of Computer Science, BIST, Bharath University, Chennai -600073 1 ABSTRACT: The information in the
More informationMaximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University
Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University While your data tables or spreadsheets may look good to
More informationWeb Mining Evolution & Comparative Study with Data Mining
Web Mining Evolution & Comparative Study with Data Mining Anu, Assistant Professor (Resource Person) University Institute of Engineering and Technology Mahrishi Dayanand University Rohtak-124001, India
More informationData Mining. Jeff M. Phillips. January 9, 2013
Data Mining Jeff M. Phillips January 9, 2013 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data
More informationAutomated Information Retrieval System Using Correlation Based Multi- Document Summarization Method
Automated Information Retrieval System Using Correlation Based Multi- Document Summarization Method Dr.K.P.Kaliyamurthie HOD, Department of CSE, Bharath University, Tamilnadu, India ABSTRACT: Automated
More informationEntity Resolution and Master Data Life Cycle Management in the Era of Big Data
ABSTRACT Paper 3920-2015 and Master Data Life Cycle Management in the Era of Big Data John R. Talburt, University of Arkansas at Little Rock & Black Oak Analytics, Inc. Proper management of master data
More informationPartition Based Perturbation for Privacy Preserving Distributed Data Mining
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 17, No 2 Sofia 2017 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2017-0015 Partition Based Perturbation
More informationPrivacy Preserving Group Linkage
Privacy Preserving Group Linkage Fengjun Li 1, Yuxin Chen 1, Bo Luo 1, Dongwon Lee 2, and Peng Liu 2 1 EECS Department, University of Kansas, 2 College of IST, Penn State University 1 Record Linkage Record
More informationStream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
More informationProbabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules
Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South
More informationIndrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin
Airavat: Security and Privacy for MapReduce Indrajit Roy, Srinath T.V. Setty, Ann Kilzer, Vitaly Shmatikov, Emmett Witchel The University of Texas at Austin Computing in the year 201X 2 Data Illusion of
More informationData Mining. Jeff M. Phillips. January 8, 2014
Data Mining Jeff M. Phillips January 8, 2014 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data
More informationData Mining: Models and Methods
Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data
More informationEntity Resolution over Graphs
Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,
More informationPPKM: Preserving Privacy in Knowledge Management
PPKM: Preserving Privacy in Knowledge Management N. Maheswari (Corresponding Author) P.G. Department of Computer Science Kongu Arts and Science College, Erode-638-107, Tamil Nadu, India E-mail: mahii_14@yahoo.com
More informationUnsupervised Detection of Duplicates in User Query Results using Blocking
Unsupervised Detection of Duplicates in User Query Results using Blocking Dr. B. Vijaya Babu, Professor 1, K. Jyotsna Santoshi 2 1 Professor, Dept. of Computer Science, K L University, Guntur Dt., AP,
More informationPrivacy-Preserving Using Data mining Technique in Cloud Computing
Cis-601 Graduate Seminar Privacy-Preserving Using Data mining Technique in Cloud Computing Submitted by: Rajan Sharma CSU ID: 2659829 Outline Introduction Related work Preliminaries Association Rule Mining
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More informationTime Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel
More informationTechniques for Large Scale Data Linking in SAS. By Damien John Melksham
Techniques for Large Scale Data Linking in SAS By Damien John Melksham What is Data Linking? Called everything imaginable: Data linking, record linkage, mergepurge, entity resolution, deduplication, fuzzy
More information