A fast approach for parallel deduplication on multicore processors

Similar documents
Efficient Entity Matching over Multiple Data Sources with MapReduce

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

SimEval - A Tool for Evaluating the Quality of Similarity Functions

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Tuning Large Scale Deduplication with Reduced Effort

Similarity Joins in MapReduce

MapReduce: A Programming Model for Large-Scale Distributed Computation

Survey of String Similarity Join Algorithms on Large Scale Data

Performance and scalability of fast blocking techniques for deduplication and data linkage

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Data Partitioning for Parallel Entity Matching

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Record Linkage using Probabilistic Methods and Data Mining Techniques

Use of Synthetic Data in Testing Administrative Records Systems

Comparison of Online Record Linkage Techniques

MRBench : A Benchmark for Map-Reduce Framework

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

Entity Resolution with Heavy Indexing

Leveraging Set Relations in Exact Set Similarity Join

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Similarity Joins of Text with Incomplete Information Formats

Load Balancing for Entity Matching over Big Data using Sorted Neighborhood

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Data Linkage Methods: Overview of Computer Science Research

A Fast and High Throughput SQL Query System for Big Data

Automatic training example selection for scalable unsupervised record linkage

AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

Optimizing Recursive Queries in SQL

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search

Sorted Neighborhood Methods Felix Naumann

Parallel data processing with MapReduce

CHAPTER 7 CONCLUSION AND FUTURE WORK

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Information Integration of Partially Labeled Data

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing

Efficient Record De-Duplication Identifying Using Febrl Framework

Large-Scale Duplicate Detection

BISTed cores and Test Time Minimization in NOC-based Systems

Parallelizing String Similarity Join Algorithms

Joint Entity Resolution

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Learning Blocking Schemes for Record Linkage

Mitigating Data Skew Using Map Reduce Application

Big Data Management and NoSQL Databases

A mining method for tracking changes in temporal association rules from an encoded database

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Leveraging Transitive Relations for Crowdsourced Joins*

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Survey on MapReduce Scheduling Algorithms

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Don t Match Twice: Redundancy-free Similarity Computation with MapReduce

A Comprehensive Complexity Analysis of User-level Memory Allocator Algorithms

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

CS 61C: Great Ideas in Computer Architecture. MapReduce

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

MapReduce for Data Intensive Scientific Analyses

Garbage Collection (2) Advanced Operating Systems Lecture 9

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Single Pass Connected Components Analysis

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Adaptive Windows for Duplicate Detection

Deduplication of Hospital Data using Genetic Programming

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Advanced Databases: Parallel Databases A.Poulovassilis

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Cassandra- A Distributed Database

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Enhanced Performance of Database by Automated Self-Tuned Systems

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Database System Concepts

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Pay-As-You-Go Entity Resolution

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

Learning Probabilistic Ontologies with Distributed Parameter Learning

Web page recommendation using a stochastic process model

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

HISTORICAL BACKGROUND

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Part 1: Indexes for Big Data

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Transcription:

A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br Renata Galante Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil galante@inf.ufrgs.br Carlos A. Heuser Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil heuser@inf.ufrgs.br ABSTRACT In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model is used to implement our approach in a sequence of both map and reduce operations. An empirical evaluation has shown that our deduplication approach is almost twice faster than BTO-BK, that is a scalable parallel deduplication solution in distributed environment. To the best of our knowledge, MD-Approach is the first to focus on multicore processors for parallel deduplication. Categories and Subject Descriptors H.2.4 Information Systems [Database Systems Parallel databases General Terms Algorithms, Performance Keywords Data integration, Deduplication, Parallel Systems Management]: 1. INTRODUCTION A central process in data integration is record deduplication. Record deduplication aims to identify which records represent the same underlying entity. Naive deduplication techniques have quadratic processing cost (O(n 2 )), that require a high runtime to process large datasets. For instance, considering a dataset of size n, n n pairs must be analyzed. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 11 March 21-25, 2011, TaiChung, Taiwan. Copyright 2011 ACM 978-1-4503-0113-8/11/03...$10.00. In order to improve the performance of record deduplication in large datasets, techniques such as blocking [12] have been proposed. Blocking methods are used as an alternative to reduce the number of comparisons. Blocking aims to group candidate similar records in an efficient way. Pair comparison just occurs inside each block. Blocks can be distributed among independent processes without expensive message exchanges. Data independence creates the scenario to extend the deduplication process to parallel environments. Complex tasks that would require days to be completed, may be executed in a short time with high effectiveness. Several approaches have been proposed [3, 5, 10, 14, 15] in order to reduce the processing time of the deduplication. These works employ distributed architectures that offer large number of processors and memory resources to speed up applications. However, distributed architectures require both specialized implementation and management, that implies in a rarely popular use. In an opposite way, this work focus on an efficient deduplication approach on multicore processors. In this paper, we present a new parallel deduplication approach for large datasets that use multicore processors. Our approach, named MD-Approach, combines a fast blocking method with a robust data parallel programming model. The blocking phase splits the dataset in small and balanced blocks into two steps. The first step aims to create large blocks to ensure that no duplicate record pairs reside in different blocks. A second step aims to segment the large blocks in sub-blocks with high degree of similarity. This step results in blocks with less false positive pairs. The two blocking steps are efficiently implemented by MapReduce parallel programming model [8]. The MapReduce is composed of a sequence of Map-Reduce operations. Our experiments, have shown that the MD-Approach is almost twice faster than BTO-BK [15], and also keeps high effectiveness. In summary, the contributions of this paper are: The definition of the two blocking steps using traditional blocking that aims to create a balanced partition of dataset. We combine our blocking method into a data parallel programming model, achieving both high performance and effectiveness. The rest of this paper is organized as follows. We first present the background in Section 2. In section 3, we discuss the related work. Section 4 presents our parallel dedu- 1027

plication approach, while the experiments are demonstrated in Section 5. Section 6 contains concluding remarks and describes future work. 2. BACKGROUND AND MOTIVATION 2.1 Deduplication Deduplication techniques identify which records refer to the same underlying entity. The duplicate records identification is essential in many integration systems. However, the deduplication is a complex process due to incomplete records, misspellings, typographical errors, abbreviations, among others. The deduplication is partitioned in four phases. The first phase is named preprocessing. In this step, preprocessing realizes a superficial cleaning and standardization. Preprocessing aims to improve the quality of datasets by optimizing the subsequent stages (for instance, to convert date format into an international standard). The second phase, blocking process, decreases the comparison number by splitting the dataset in groups (blocks) of similar records. In the third phase, named match step, the records that belong to the same block are compared. The comparison is done by using similarity functions that quantify the similarity of candidate pairs. The match step is considered a bottleneck in the deduplication systems [2]. It becomes unaffordable to compare a massive number of pairs in large datasets. Finally, the fourth phase, named classification step, defines if the candidate pairs are duplicate or not, by using of a threshold that can be previously calibrated and defined. 2.2 Blocking process In order to reduce the number of records comparison, blocking approaches have been used to segment datasets in blocks of similar records. Thus, only the records presents in the same block are analyzed. It reduces the number of pair to be processed. The classic blocking method defines the concept of Blocking Function (BF). The BF is composed of a set of operators that is applied to each record. The operator selects fields, or fragments of fields, to create a key. Each key can be composed of many fields. The records with the same key are grouped in the same block. Formally, we can assume that a dataset contains n records: D={R 1, R 2,..., R n}. Each record is composed of several fields: R 1={f 1, f 2,.., f t}. The operator is defined by concatenation of fields or fragments of fields, for example, Op={ first three letters f 1 first three letters f 2 }. The BF contains a set of i operators: BF ={Op j} i j=1. When the BF is applied to each record, R i produces a set of keys: BF (R 1)={key Op1, key Op2,..., key Opt }. Each record can result in several keys according to BF. We assume that an indexing function h(key i) transforms a key in a number. Equal numbers are grouped in the same block. However, a tradeoff of blocking function is that small variation in some characters makes impossible to insert the record in the true block. In order to avoid loss of the true matches, the operator must be generic enough to cover a largest number of keys. The definition of generic operators can frequently result in largest blocks. Largest blocks may result in unbalanced load that increases the application runtime. 1028 2.3 Parallel programming model Data parallel programming models aim to reduce the complexity of the parallel applications developed in large datasets. It provides a high-level interface that frees the programmer from detailed of parallel issues. Figure 1: MapReduce architecture MapReduce (MR) [7] was proposed by Google as a new solution for intensive processing in large scale. MR is a data parallel programming model in which each process has a fragment of dataset and run without any block condition. In MR, the users specify their applications by sequence of Map-Reduce operations. All issues involving the parallelism are assigning to MR, simplifying the application development. The MR programming model is based on two functional programming primitives: Map - receives a portion of the input file, applies a function defined by an user, and launches a set of intermediate pairs in the key/value format. Reduce - each reduction function is responsible for a set of values associated with each key, like the statement group by in SQL language. The processing, defined by the user, is performed on each set of values. Finally, each reduction function launches a set of tuples. Figure 1 illustrates the MR architecture. After the main program has started, the following succession of operations must be performed: (1) the dataset is fragmented in slices of a certain size. (2) Each partition is assigned to a specific worker, which call the Map function. (3) Each Map worker stores intermediate tuples in a private hash table. The keys are inserted in a specific hash table entry. A chain is created when the hash entry is filled. The chain is sorted following the alphabetically order of the keys. (4) After all map workers have finished their work, the entries of hash table are sent to the Reduce function. The same entry of each hash table is assigned to the same reduce worker. This join the same keys created by different workers in the same block. (5) The Reduce function analyses the values of each key agreed with the implementation of the application. The outputs of Reduce function are emitted in a set of key/value pairs. At last, (6) the key/value is stored in the file system. 3. RELATED WORK Deduplication has been extensively studied in order to look for an effectiveness and fast approach by using a robust function similarity [10]. The naive approach can be defined by using two nested loops comparing each pair. Also,

the naive approach has quadratic cost and it becomes prohibitive when we work with large datasets. Blocking techniques have been used to improve the performance in several deduplication systems [3, 5, 14]. Several researches have addressed different blocking approaches [9]. Traditionally, blocking function (BF) are defined by a set of key fields. Keys are used to define which blocks the record belongs [2]. Canopy clustering [11] divides the blocking into two phases. First, an approximate distance metric is used to quickly build blocks, for example TF/IDF. Second, a robust and accurate method is applied to improve the blocking created in the first step. Suffix array blocking [1] creates keys by combining suffix of fields. The keys are grouped into an inverted index. An improvement is proposed in [6] where adjacent keys are compared using a similarity function. The keys that show a high degree similarity are blocking together. The similarity function intent to improve the effectiveness by blocking similar records. First distinction between previous researches and our approach is that the related work create several keys using the suffix of each field. It generates a large number of the key for dataset composed of millions of records. Our approach creates keys using composition of fields or segments of fields to minimize the number of false positive pairs and also maintaining a good effectiveness. Second distinction is that we apply the more expensive blocking step into just unbalanced blocks. The second blocking step minimize the number of false positive pairs. The large blocks are broken up into sub-blocks. Several parallel deduplication systems have been proposed in order to improve the performance. D-swoosh [3] proposes a generic approach for matching and merging duplicates in distributed environments. A set of approaches is proposed in [10] addressing data integration based in the status of the dataset. P-Febrl [5] proposes a parallel extension of the sequential algorithm, but no detail has been shown. Feraparda [14] presents an approach that explores the pipelining parallel programming model. Blocking function is used to reduce the number of the candidate pairs. In our approach, we improved the efficiency by combining the two blocking step, that significantly reduce the number of comparisons, with a robust data parallel programming model. Recently, [15] presented a signature-based approach that combines both PPJoin method and MR programming model. The signature-based algorithms sort the dataset into decreasing order, following the term frequency. The terms position defines the value of the signature. The blocking key is formed by the less frequent terms of each record. After the blocking step, a set of filters (PPJoin method) are specified in order to prone false positive pairs before the match step. Finally, a similarity function is used to quantify the number of exact signatures of each candidate pair. This algorithm has the following gap that motivates our research. The signature-based algorithms minimize the computational costs replacing the exact fields by a global ID. Nevertheless, records that contains a lot of mistakes probably receive distinct IDs. It forbids to insert in the correct block. Further, the creation of index adds an overhead to scan all dataset. To the best of our knowledge, our approach is the first to use a multicore architecture. In this way, we significantly reduce the overhead of communication. As a result, we have specified a parallel deduplication approach with both high performance and effectiveness. 1029 Figure 2: Overview of Two blocking step method 4. MD-APPROACH In this section, we present our parallel deduplication approach, named MD-Approach. First, we describe our blocking method. The blocking is composed of two steps that perform both good effectiveness and load balancing. Second, we show how we combine the blocking method with a data parallel programming model in order to achieve high performance. It is important to notice that our focus is to put all datasets into computer main memory. 4.1 Two blocking step The main challenge of blocking approaches is the segmentation of the dataset into blocks both precision and balanced. The creation of balanced blocks may determine the performance of the application. For example, consider that we have three unbalanced blocks, containing 200, 240, and 400 records, respectively. The runtime of parallel deduplication approach should be defined by the largest block. The block containing 400 records will create 160 thousands of the candidate pairs (O(n 2 )). In this case, while one processor is busy the others are kept idle. If possible, the ideal blocking approach should split the biggest block (400 records) into two sub-blocks. Figure 2 illustrates our proposed blocking method. The blocking is composed of two steps based on blocking functions. The first blocking step maximizes the generation of candidate pairs. The dataset is segmented in large blocks. Some blocks can be bigger than others which may results in unbalanced processing. It means that the distribution frequency of each value used in the blocking key can be different. The well balanced blocks are directly committed to the match step. Blocks identified as unbalanced are sent to the second blocking step. The second blocking step splits unbalanced blocks using more specific keys. The second blocking step output is sent to match step. The first blocking step splits the dataset in large blocks. The problem of removing true pair is solved once that we create large and imprecise blocks. A large number of the keys are produced by joining parts of different fields. For example, we may define the BF composed of birth year and first three letters of surname. The unspecific keys created in the first blocking step may produce large blocks with large number of the false positive pairs. In this way, we introduce the second blocking step that segments the unbalanced blocks removing the false positive pairs. In this step, the BF is changed to whole field. For example, the BF composed of birth year and the first three letter of surname (first blocking step), is changed to birthdate

and surname in the second blocking step. Consequently, the second blocking step groups records with exact fields. However, it is difficult to add records to the true block when there are several mistakes. To avoid this, we define a sliding window function to block similar keys. This function avoids that incomplete or misspelling records to be inserted into different blocks. The sliding window function is used to insert similar keys in the same block. First, the keys are appended in an index, following alphabetical order. It ensures that the similar records will be near to each other. Before a new key is appended at the specific position, the adjacent keys inside a sliding window are compared using a similarity function. The key is inserted in a neighbor block if some adjacent key is highly similar. Otherwise, the key is inserted at the original position. For example, if a key is appended in the index at the position number 200 and the adjacent neighbor number 199 is highly similar, then, the position of the the key will change to 199 and the record will be blocked together. Otherwise, if the adjacent keys are low similar, then the key will be appended to the original position. Finally, the second blocking step creates high specific blocks without removing the true matches. It reduces considerably the number of pairs to be compared. However, the similarity measure between adjacent key adds a cost in the creation of blocks. As a consequence, the first blocking step is a cheap and low precise process. In contrast, the second blocking step is precise but more expensive process. 4.2 Parallel deduplication approach Both blocking steps can be conveniently implemented using the MR data parallel programming model. A sequence of mapping, sort, and reduction operation are linked to implement our approach. Figure 3 shows the overview of the sequence tasks to be performed by our approach. To better explore the data parallelism, we have divided our approach into four phases: Phase I : dataset is sequentially fragmented. Each fragment is processed in parallel. At this time, the keys of the first blocking step are created. All the next phases are performed in parallel, following the MR model. Phase II : the blocks are merged and unbalanced blocks are sent to the Phase III. Records in balanced blocks are compared (match step). Pairs identified as duplicated are sent to Phase IV. Phase III : the unbalanced blocks are fragmented in sub-blocks. The BF is identified and the key is created using the second blocking step. The sliding window function is used to block high similar keys. The records inside the sub-blocks are compared and duplicates are sent to the next phase. Phase IV : pairs are merged into the output file and duplicates are removed. 4.2.1 Phase I Phase I begins creating the dataset segments. The Map workers take each row of input document, and apply the first blocking step. The Map workers output intermediates key-value pairs with each blocking key as the key and the record as the value. For each blocking function a set of 1030 Figure 3: Sequence of MD-Approach phases keys is generated. It produces a large number of redundant keys to be processed in the next phase. The misspelling or lack of information in some attributes force to create several blocks of the same record. Redundancy is essential to insert duplicate in the true block. Finally, the key-value is sent to MR, where the tuples are sorted and stored, while all the map tasks are finished. 4.2.2 Phase II In Phase II, the blocks are built join the entries of the hash tables. Before the blocks are sent to the reduction function, each block is analyzed searching for unbalanced blocks. Identification of unbalanced blocks is made by comparing a threshold with the records number of each block. The threshold determines if the block will be directly performed in the current phase (balanced block) or it will be sent to the second blocking step (unbalanced block). Each candidate pair is performed by the reduction function. The Reduce workers receive a key and a set of values that represent a block. Inside each block, each pair is individually compared using a similarity function. Finally, an arithmetic average is performed on the values returned by the similarity function. The arithmetic average is compared with the similarity threshold. If the pair average is higher than the threshold, it is considered a duplicate and it is sent to Phase IV. Otherwise, the pair is discarded. 4.2.3 Phase III This phase receives the unbalanced blocks and segment in specific sub-blocks using the second blocking step. Similar keys are blocked together employing the sliding window function. This function applies a sliding window looking over similar keys. More specifically, this phase receives blocks composed of the keys. First, the Map function identifies the BF that generates the blocking key created by the first blocking step. The identification of the BF avoids an explosion in the number of keys. After the BF is identified, the second blocking step expands each key to accommodate fully fields. The keys are emitted by the Map workers and stored in the hash table. However, mistakes into records result in the insertion of the incorrect block. In this way, before the keys are sorted into hash table, we perform the sliding window function. In this function, first a binary search is made to

find a position of each blocking key. When the position is known, we compare adjacent keys inside of a sliding window. If the similarity of a key pairs is upper than the threshold, the key is inserted into the already presented block. Otherwise, a new hash entry is created following the position defined by the binary search. After all keys are sorted, the reduce function is started. The reduce function is the same showed in the Phase II. 4.2.4 Phase IV In the last phase, we put all pairs together. Short mapreduce operations are used to clean the output file. It occurs because both blocking phases generate redundancy. Then, all replicated pair is identified and removed. Figure 4: Runtime of MD-Approach using the balanced and unbalanced blocking (1M Dataset) 5. EXPERIMENTAL EVALUATION In this section, we evaluate the MD-Approach on a multicore machine by comparing with [15]- a distributed MapReduce implementation of the PPJoin approach. We describe our experimental setting and report the main contribution of our approach. The focus of the MD-Approach is to improve the performance and keep the effectiveness, on multicore processors. We implemented the MD-Approach using Phoenix [13], which is a shared memory MapReduce implementation written in C language. Phoenix is a robust and efficient platform to performance application in shared memory architectures. The main advantage of the Phoenix is the low communication overhead without excessive data copying. The dataset is whole copies to main memory and pointers are used for moving data. To enable a better comparison with our approach, we also implemented a shared memory version of BTO-BK algorithm [15] on Phoenix MapReduce platform. 5.1 Measurements & Definitions For each dataset, we ran MD-Approach and BTO-BK on 1, 2, 3, and 4 cores. For MD-Approach, we used the Jaro- Winkler similarity distance metric with threshold δ= 0.9 and the size of sliding window function as 1 (defined experimentally). For BTO-BK, we used Jaccard similarity metric and defined experimentally the threshold δ= 0.5. We used synthetic datasets generated by Febrl data generator (Dsgen) [4]. The synthetic dataset allows full control of record numbers and duplicates. Synthetic dataset allows measuring the exactly precision of the application. The datasets ranged from 1 million (1M), 2 million (2M) and 4 million (4M). We defined 10% of duplicates in each dataset by introducing different errors. We have used a Intel Core 2 Quad processor (2,6 GHz) running Ubuntu 9.10 64 bits. The machine had 8 Gb RAM. We compare the effectiveness using recall, precision and f-measure. Recall evaluated the portion of correctly duplicated found, relative to all known duplicates in the data set. Precision measures the portion of correct duplicated, relative to all duplicated found. The f-measure is the harmonic mean of precision and recall. To compute speedup, we ran the MD-Approach and BTO- BK in 1,2,3, and 4 cores. Let us call Tp as runtime in p processors. The speedup was measured by computing the ratio of T 1 with its parallel time. We measured the wallclock time and computed the average over three runs. 5.2 Effectiveness 1031 Table 1: Effectiveness comparison Precision Recall F-Measure MD-Approach 90,49 98,69 94,41 BTO-BK 100,00 70,2 82,49 In this section we compare the effectiveness of BTO-BK and MD-Approach. We identify experimentally the best input parameters of both approaches 1. To evaluate the effectiveness of our blocking approach, we compare the runtime of MD-Approach using the first blocking step, without identification of unbalanced blocking and using the two blocking step (balanced blocking). Figure 4 shows both runtime of blocking methods. The number of pairs performed by balanced blocking is almost 400 million less than unbalanced blocking, using the 1M dataset. Consequently, the runtime of balanced blocking is almost 3 times lower than the unbalanced method. It is explained by high number of false positive pairs generates by unbalanced blocking. Table 1 shows the recall, precision and f-measure of BTO- BK and MD-Approach. It is important to notice that, MD-Approach had better f-measure than BTO-BK. Interestingly, BTO-BK achieved precision almost 100% due to the set of filters proposed by PPJoin approach that prune false positive pairs. The recall of BTO-BK is result of the signature-based algorithm that block only exactly keys. Our approach achieved a recall of 98,69% due to use of the two blocking step and similarity windows function that improved the quality of blocks. This clearly demonstrates the superiority effectiveness of our approach over BTO-BK. 5.3 Speedup and Scalability In this section, we compared MD-Approach with BTO-PK in terms of dataset size and runtime. Figure 5 shows the comparison of runtime of each dataset when the dataset is increased. The runtime of MD-Approach was almost 2 times less than BOT-BK. The runtime of dataset 4M showed 2.1 times less. These results demonstrate the efficiency of our approach, achieved by combining the data parallel model and the two blocking steps. The effective blocking approach reduces significantly the number false positive pairs to be performed in the match step. Figure 6 shows the speedup of MD-Approach and OPT- BK in three datasets. In 1M and 2M datasets, both ap- 1 Details are omitted due to space restriction.

Figure 5: Runtime of MD-Approach and BTO-BK Figure 6: Speedup comparison proaches achieved the same speedup due to both explores the MapReduce parallel platform. When we increase the dataset to 4M, the speedup of our approach has a little increase due to direct blocking process and the creation of balanced blocks. The speedup of BTO-BK was kept the same due to the large number of the false positive pairs created by the signature-based blocking process. 6. CONCLUSION In this work, we have shown a fast solution for parallel deduplication on multicore processors. Our solution, named MD-Approach, improve multicore processors assuring high performance. MD-approach combines a robust data parallel programming model with an effective blocking method. The blocking method consists on the two steps. The first blocking step creates large blocks to maximize the number of pairs. The second blocking step fragments large blocks in small blocks, assuring the precision of the sub-blocks. By combining the two blocking step and MapReduce data parallel model, we specify a new efficient approach for deduplication. An empirical evaluation shown that our deduplication approach is faster and more effective than the state-of-art of BTO-BK approach. As a future work, we intend to extend our experiments with real datasets. Additionally, we plan to include other blocking approaches in order to minimize the redundant number of created keys by the blocking method. 7. ACKNOWLEDGMENTS This research is partially supported by the Brazilian National Institute for Web Research (grant number 573871/2008-6) and CNPq/Brazil. 8. REFERENCES 1032 [1] A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pages 30 39, Washington, DC, USA, 2005. [2] R. Baxter, P. Christen, and C. F. Epidemiology. A comparison of fast blocking methods for record linkage. In Proceedings of Workshop Data Cleaning, Record Linkage, and Object Consolidation, 2003. [3] O. Benjelloun, H. Garcia-Molina, H. Gong, H. Kawai, T. E. Larson, D. Menestrina, and S. Thavisomboon. D-swoosh: A family of algorithms for generic, distributed entity resolution. Distributed Computing Systems, International Conference on, 0:37, 2007. [4] P. Christen. Probabilistic data generation for deduplication and data linkage. In Intelligent Data Engineering and Automated Learning - IDEAL 2005, pages 109 116. Springer Berlin / Heidelberg, 2005. [5] P. Christen, T. Churches, and M. Hegland. Febrl - a parallel open source data linkage system. In PAKDD, pages 638 647, 2004. [6] T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM 09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 305 314, New York, NY, USA, 2009. ACM. [7] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, Berkeley, CA, USA, 2004. [8] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107 113, 2008. [9] A. K. Elmagarmid, P. G. Ipeirotis, Vassilios, and S. Verykios. Duplicate record detection: A survey. Transactions on Knowledge and Data Engineering. [10] H. Kim and D. Lee. Parallel linkage. In CIKM 07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007. [11] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD 00: ACM SIGKDD. ACM, 2000. [12] C. Peter. Performance and scalability of fast blocking techniques for deduplication and data linkage. Proc. VLDB Endow., 1(2):1253 1264, 2007. [13] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA 07, pages 13 24, 2007. [14] W. Santos, T. Teixeira, C. Machado, W. M. Jr., R. Ferreira, D. Guedes, and A. S. D. Silva. A scalable parallel deduplication algorithm. Computer Architecture and High Performance Computing, Symposium on, 0:79 86, 2007. [15] R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD 10: Proceedings of the 2010 international conference on Management of data, 2010.