Comparison of Online Record Linkage Techniques

Similar documents
Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

INTELLIGENT SUPERMARKET USING APRIORI

Correlation Based Feature Selection with Irrelevant Feature Removal

Record Linkage using Probabilistic Methods and Data Mining Techniques

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

R. R. Badre Associate Professor Department of Computer Engineering MIT Academy of Engineering, Pune, Maharashtra, India

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Deduplication of Hospital Data using Genetic Programming

Preparation of Data Set for Data Mining Analysis using Horizontal Aggregation in SQL

International Journal of Advanced Research in Computer Science and Software Engineering

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

An Ameliorated Methodology to Eliminate Redundancy in Databases Using SQL

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

A Framework for Securing Databases from Intrusion Threats

Optimization of Query Processing in XML Document Using Association and Path Based Indexing

Collaborative Framework for Testing Web Application Vulnerabilities Using STOWS

Supporting Fuzzy Keyword Search in Databases

Efficient Record De-Duplication Identifying Using Febrl Framework

Comprehensive and Progressive Duplicate Entities Detection

Best Keyword Cover Search

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

Learning Blocking Schemes for Record Linkage

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

XML Clustering by Bit Vector

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Density Based Clustering using Modified PSO based Neighbor Selection

Closest Keywords Search on Spatial Databases

Frequent Pattern Mining On Un-rooted Unordered Tree Using FRESTM

QUERY RECOMMENDATION SYSTEM USING USERS QUERYING BEHAVIOR

Item Set Extraction of Mining Association Rule

A Methodology in Mobile Networks for Global Roaming

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

Efficient Entity Matching over Multiple Data Sources with MapReduce

Modelling Structures in Data Mining Techniques

Reduce convention for Large Data Base Using Mathematical Progression

PREDICTION OF POPULAR SMARTPHONE COMPANIES IN THE SOCIETY

A Survey on Keyword Diversification Over XML Data

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Concept-Based Document Similarity Based on Suffix Tree Document

Study of Personalized Ontology Model for Web Information Gathering

International Jmynal of Intellectual Advancements and Research in Engineering Computations

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

A Hybrid Algorithm Using Apriori Growth and Fp-Split Tree For Web Usage Mining

Data Extraction and Alignment in Web Databases

Closed Pattern Mining from n-ary Relations

Chapter 1, Introduction

Text Document Clustering Using DPM with Concept and Feature Analysis

Novel Hybrid k-d-apriori Algorithm for Web Usage Mining

ISSN Vol.05,Issue.09, September-2017, Pages:

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Improving effectiveness in Large Scale Data by concentrating Deduplication

A Transaction Processing Technique in Real-Time Object- Oriented Databases

Accumulative Privacy Preserving Data Mining Using Gaussian Noise Data Perturbation at Multi Level Trust

Generating Cross level Rules: An automated approach

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Data Mining in the Context of Entity Resolution

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

EFFECTIVE INTRUSION DETECTION AND REDUCING SECURITY RISKS IN VIRTUAL NETWORKS (EDSV)

Comparative studyon Partition Based Clustering Algorithms

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

A Modified Hierarchical Clustering Algorithm for Document Clustering

Knowledge discovery from XML Database

Entity Resolution with Heavy Indexing

A Review on Mining Top-K High Utility Itemsets without Generating Candidates

Keywords Data alignment, Data annotation, Web database, Search Result Record

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

Learning Graph Grammars

Improving Classifier Performance by Imputing Missing Values using Discretization Method

NETWORK SECURITY PROVISION BY MEANS OF ACCESS CONTROL LIST

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

COMPUSOFT, An international journal of advanced computer technology, 4 (6), June-2015 (Volume-IV, Issue-VI)

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Image Similarity Measurements Using Hmok- Simrank

(Big Data Integration) : :

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

ISSN: (Online) Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

FUFM-High Utility Itemsets in Transactional Database

Web Data mining-a Research area in Web usage mining

Detection and Deletion of Outliers from Large Datasets

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

A KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Elimination Of Redundant Data using user Centric Data in Delay Tolerant Network

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

DESIGN OF HYBRID PARALLEL PREFIX ADDERS

Research Article QOS Based Web Service Ranking Using Fuzzy C-means Clusters

Answering XML Query Using Tree Based Association Rule

Implementing Synchronous Counter using Data Mining Techniques

Transcription:

International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI. S Assistant Professor, Amaljyothi college of engineering, Kerala, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract Record linkage refers to the act of linking records from different data sources. In order to accomplish this goal, one must resolve several types of schema integration problems since the records are contained in different databases. Statistical record linkage techniques could be used for solving this problem. However, the use of such techniques for online record linkage could result a huge communication overhead in a distributed environment where entity heterogeneity problems are often encountered. In order to resolve this issue, a matching tree is developed, which is similar to a decision tree. Using this matching tree we could obtain results that are guaranteed to be the same as those obtained using the conventional linkage techniques. Using this technique, communication overhead while linking records can be reduced significantly. In this paper three such record linkage techniques which work based on the matching tree are described.this work compares the techniques based on communication overhead and system parameters. The communication overhead can be calculated as the percentage of size of the remote database. System parameters like system memory, system speed, load at the remote site etc are considered for monitoring the performance of the techniques. KeyWords Record linkage, matching tree, decision tree, entity heterogeneity, sequential partitioning, and concurrent partitioning. 1. INTRODUCTION Data mining also known as data or knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information. 2015, IRJET Data mining software is one of a number of analytical tools for analyzing data. It helps users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, it is the process of finding correlations or patterns among dozens of fields in large relational databases. Record linkage is the process of quickly and accurately identifying records that belongs to the same entity across a number of heterogeneous databases. It is also known as duplicate record detection[3], data cleaning, entity reconciliation, deduplication (when applied to a single database) etc.this process is one of the major initial steps in many data mining applications. Record linkage techniques have been widely used in real-world situations such as health management systems, census where all the records are available locally. However, when the matching records reside at a remote site, existing techniques cannot be directly applied because they would involve transferring the entire remote relation, thereby incurring a huge communication overhead and entity heterogeneity problems [5],[9]. Usually the current researches [6] use statistical record linkage methods [8]. The simplest kind of record linkage, called deterministic or rule-based record linkage, generates links based on the number of individual identifiers that match among the available data sets. Two records are said to match via a deterministic record linkage method if all or some identifiers are identical. Deterministic record linkage is a good option when the entities in the data sets can be identified by a common identifier. Probabilistic record linkage, also called fuzzy matching, takes into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a nonmatch, and using these weights to calculate the probability that two given records refer to the same entity. Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with probabilities below another threshold are considered to be nonmatches and pairs that fall between these two thresholds are considered to be possible matches. Whereas deterministic record linkage requires a series of potentially complex rules to be programmed ahead of time, probabilistic record linkage methods can be ISO 9001:2008 Certified Journal Page 2236

"trained" to perform well with much less human effort. In order to improve the accuracy of record linkage, techniques such as blocking [2], filtering [4],indexing [10]etc also can be used. 2. AN OVERVIEW OF RECORD LINKAGE existing system covers all the methods based on the decision tree[7], and this paper does a comparison or an analysis of all the techniques based on the system parameters.the parameters include load at the remote system, speed and memory of the client systems. 3. SEQUENTIAL RECORD LINKAGE The main aim of these techniques is to reduce the number of candidate record pair comparisons to a feasible number, at the same time accuracy must be maintained. This is because as the number of record pair comparison reduces, the communication overhead can be reduced. Fig-1. linkage The over all process of tree based record In recent years, the need to collect information contained in heterogeneous databases has been documented. In order to do this, we need to resolve the entity heterogeneity problems. The statistical record linkage techniques can be used in situations like census, hospital management systems etc where all the records which are to be compared reside at the local system itself. But when it comes online, the statistical methods could pose a large amount of overhead, since the number of records to be compared will be very large and also because of the heterogeneity problems. In the existing system, a matching tree technique is used instead of the statistical techniques. The matching tree is very similar to that of a decision tree. Using this method, existing system could reduce the communication overhead significantly. The results obtained using the matching tree techniques are guaranteed to be that of the results which are obtained using the traditional technique. Thus while retaining accuracy of the record linkage; this method could reduce the communication overhead to a large extent. The As an initial step, the matching tree is created offline using a trained dataset and the record linkage procedure is done online when required. It means that in the case of tree based techniques, there will be a decision tree created offline for each client system. The main benefit of this approach is that the tree can be precomputed and stored. So computational overhead at the time of querying can be avoided. Sequential record linkage uses the technique of sequential acquisition of information. It means that there will be a sequence in which the next best attribute is decided. It is completely dependent on the previously acquired attributes. The matching tree actually gives the order in which the attributes can be matched while comparing records. The records at the remote site are partitioned in accordance with the attribute acquisition order, specified in the matching tree. This could help to reduce the number of candidate pairs to a greater extent. 4. MATCHING TREE GENERATION The main benefit of the sequential linkage is that not all the attributes are brought to the local site. This is the major difference in traditional record linkage and tree based record linkage. In this method attributes are brought one at a time to the local site. 2015, IRJET ISO 9001:2008 Certified Journal Page 2237

After acquiring an attribute, a decision is made whether or not to acquire more attribute, based on the matching probability. The attributes are acquired till the matching probability can not be revised sufficiently. 5. TREE BASED LINKAGE TECHNIQUES The records at the remote site can be partitioned in two ways: 1. Sequential partitioning 2. Concurrent Partitioning In the case of sequential partitioning the set of remote records is partitioned recursively, till the desired partition of all the relevant records are obtained. This recursive partitioning can be done in one of two ways: 1) By transferring the attributes of the remote records and comparing them at the local site, or 2) By sending a local attribute value, comparing it with the values of the remote records, and then transferring the identifiers of those remote records that match on the attribute value. Fig-2. The matching tree The generation of the tree is based on the trained data set. The tree creation is one of the major steps involved in this process. It can be described as follows: First select the trained data set. After a discrete process select the attribute which will be having the highest probability for being a true match from the set. Make it as the root node. Now split the remaining attributes by terms of the selected attributes. When the selected attribute is acquired, find out the attribute with the highest probability to be acquired. Also find out the attribute with the least probability to be true when the root node is acquired. They will be the left and right children of the root node respectively. Continue this process until all leaf nodes are generated. The figure 2 indicates a matching tree, with 1 indicates a match (true) and 0 indicates a mismatch (false). In the concurrent partitioning method, the tree is used to resolve a database query that selects the relevant remote records directly, in one single step. Hence, there is no need for identifier transfer. Once the relevant records are identified, all their attribute values are transferred to the local site. A. Sequential Attribute Acquisition (SAA) In this scheme as mentioned before, attributes are acquired from the remote site in a sequential fashion. Working with the tree in figure 2, we first acquire attribute y3 for all the remote records in R, where R is the remote system. When y3 value is checked to that of the local enquiry record, either there would be a match or mismatch. If it is a mismatch acquire attribute y2. In the case of a match acquire attribute y7. Now the sequential records are actually partitioned into two sets, the one which matches the attribute y3, and the other which does not match y3. The partitioning is continued till the entire local enquiry records are acquired, based on the tree. 2015, IRJET ISO 9001:2008 Certified Journal Page 2238

The communication overhead of the SAA technique is composed of three elements. 1) Transfer of attribute values from remote to the local site 2) the transfer of all the identifiers between the remote and the local sites, and 3) the transfer of those records that have a matching probability. B. Sequential Identifier Acquisition (SIA) This technique is similar to SAA, but the difference is that the comparison is done at the remote site unlike SAA. Here one attribute value of the enquiry record is sent to the remote site and comparison is done over there. Again the partition at the remote site is done recursively; with the order of acquisition of attributes based on the tree. In each step the identifiers of the partitioned records are sent to the local site. Here also communication overhead consists of three elements; identifier overhead, attribute overhead, and record overhead. C. Concurrent Attribute Acquisition (CAA) The main drawback of the sequential schemes (SAA and SIA) is the back and forth transfers of attributes between sites. The resulting communication overhead is high, especially when there are a large number of remote records. In this approach, the matching tree developed earlier is used to formulate a database query which is posed to the remote site to acquire only the relevant records. Although such a query would usually be quite long and complex, conceptually it is easily constructed by using the matching tree and can be generated automatically. Here the matching tree sent to the remote site when a query for a record is made, and matching is done at the remote site itself. So the records which match the query can be sent back to the local system in a single step. No identifier transfer is required in this technique. 6. COMPARISON RESULTS The communication overhead involved in each technique [1] is compared at various system speed and memory. The speed and memory variation creates effects in the three techniques in equal manner, but the major factor deciding the performance was the load at the remote site. Among the three techniques the performance of the Concurrent Attribute Acquisition is very efficient when the number of remote records is large. Let n be the number of remote records, then the communication complexity increases the SAA is used. This is because as the load at the remote site increases the back and forth transfer of attributes also increases. Because this technique does the comparison at the local site. But the performance of SIA is far better when compared to that of SAA. This is because SIA compares records at the remote site itself, the attribute overhead can be reduced. When n is small, the SIA performs the best. Because the computational overhead in constructing the tree will be greater than that of the overhead in the transfer of identifiers in SIA. So in such cases CAA does not show a good performance. Usually SIA is faster than SAA and for large remote databases CAA is the most efficient technique. CONCLUSIONS In this work the comparison of online record linkage is done based on the load at the remote site. Even though speed and memory were also considered, it is found that load at the remote system is the important factor in deciding the performance of linkage techniques. These results will be helpful to make further modifications in the techniques to improve the efficiency. 2015, IRJET ISO 9001:2008 Certified Journal Page 2239

REFERENCES [1] Debabrata Dey, Vijay S. Mookerjee, and Dengpan Liu, Efficient Techniques for Online Record Linkage, IEEE Transactions on Knowledge and Data Engg, Vol.23, March 2011. [2] R. Baxter, P. Christen, and T. Churches, A Comparison of Fast Blocking Methods for Record Linkage, Proc. ACM Workshop Data Cleaning, Record Linkage and Object Consolidation, pp. 25-27, Aug. 2003 [3] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, Duplicate Record Detection: A Survey, IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007. [4] L. Gu and R. Baxter, Adaptive Filtering for Efficient Record Linkage, Proc. Fourth SIAM Int l Conf. Data Mining (SDM 04), pp. 22-24, Apr. 2004. [5] C. Batini, M. Lenzerini, and S.B. Navathe, A Comparative Analysis of Methodologies for Database Schema Integration, ACM Computing Surveys, vol. 18, no. 4, pp. 323-364, 1986. [6] W.E. Winkler, Advanced Methods of Record Linkage, Proc. Section Survey Research Methods, pp. 467-472, 1994. [7] J.R. Quinlan, Induction of Decision Trees, Machine Learning, vol. 1, no. 1, pp. 81-106, 1986. [8] J.B. Copas and F.J. Hilton, Record Linkage: Statistical Models for Matching Computer Records, J. Royal Statistical Soc., vol. 153, no. 3, pp. 287-320, 1990. [9] S.N. Minton, C. Nanjo, C.A. Knoblock, M. Michalowski, and M. Michelson, A Heterogeneous Field Matching Method for Record Linkage, Proc. Fifth IEEE Int l Conf. Data Mining (ICDM 05), pp. 314-321, Nov. 2005. [10] P. Christen, A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication, IEEE Transactions on Knowledge and Data 2015, IRJET ISO 9001:2008 Certified Journal Page 2240