Performance and scalability of fast blocking techniques for deduplication and data linkage

Size: px

Start display at page:

Download "Performance and scalability of fast blocking techniques for deduplication and data linkage"

Emmeline Barnett
5 years ago
Views:

1 Performance and scalability of fast blocking techniques for deduplication and data linkage Peter Christen Department of Computer Science The Australian National University Canberra ACT, Australia ABSTRACT An increasingly important step in many data management and analysis projects is the matching and aggregation of records from several databases, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. Similarly important is finding duplicate records in a single database. A main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. For deduplication, each record potentially has to be compared to all others. Various techniques, collectively known as blocking, have been developed in the past to deal with this quadratic complexity. In this paper, we discuss several recently developed blocking techniques and empirically evaluate them within a common framework with regard to their performance (quality of the candidate record pairs generated) and scalability (processing times and main memory required). As our experiments show, some techniques require large main memory data structures or have long processing times, which makes their application to very large databases problematic.. INTRODUCTION With many businesses, government agencies and research projects collecting massive amounts of data, techniques that allow efficient processing, analysing and mining of large databases have in recent years attracted interest from both academia and industry. An increasingly important task in the A related paper, titled Improving data linkage and deduplication quality through nearest-neighbour based blocking [9], that describes similar experiments has been submitted by the author to KDD 7. It does not, however, address complexity and scalability issues at all and is focussed on modifications to existing blocking techniques that improve the quality of the candidate record pairs generated. The paper can be downloaded from: Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB 7, September 3-8, 7, Vienna, Austria. Copyright 7 VLDB Endowment, ACM /7/9. data pre-processing step of many data management and analysis projects is detecting and removing duplicate records that relate to the same entity within one database. Similarly, matching records relating to the same entity from several databases is often required as information from multiple sources needs to be integrated, combined or linked in order to enrich data and allow more detailed data mining studies. The aim of such linkages is to match and aggregate all records relating to the same entity, such as a patient, a customer, a business, a product, or a genome sequence. Data (or record) linkage and deduplication are also called object identification, merge/purge processing, data matching, data cleansing, duplicate detection, entity resolution, list washing or ETL (extraction, transformation and loading). They can be used to improve data quality and integrity, to allow re-use of existing data sources for new studies, and to reduce costs and efforts in data acquisition [6, 7]. In the health sector, for example, linked data might contain information that is needed to improve health policies, and that traditionally has been collected with time consuming and expensive survey methods [3]. Statistical agencies routinely link census data for further analysis [8, 9], while businesses often deduplicate their databases to compile mailing lists, or link them for collaborative e-commerce projects or when integrating databases from different sources. Within taxation offices and departments of social security, data linkage is used to identify people who register for assistance multiple times or who work and collect unemployment benefits. Another application of current interest is the use of data linkage in crime and terror detection. Security agencies and crime investigators increasingly rely on the ability to quickly access files for a particular individual, which may help to prevent crimes and terror by early intervention. The problem of finding similar entities does not only apply to records which refer to persons. In bioinformatics, data linkage can help find genome sequences in large data collections that are similar to a new, unknown sequence at hand. The removal of duplicates in the results returned by Web search engines and automatic text indexing systems, where copies of documents (for example bibliographic citations) have to be identified and filtered out before being presented to the user, is increasingly important. Finding and comparing consumer products from several online stores is another application of growing interest [4]. As product descriptions are often slightly different, linking them becomes difficult. If unique entity identifiers (or keys) are available in all the databases to be linked, then the problem of linking at

2 Database A Database B Matches Cleaning and standardisation Cleaning and standardisation Classification Possible matches Blocking Record pair comparison Non matches Figure : General data linkage process. The output of the blocking step are candidate record pairs, while the comparison step produces vectors with numerical matching weights. the entity level becomes trivial: a simple database join is all that is required [6]. However, in most cases no unique keys are shared by all databases, and more sophisticated linkage techniques need to be applied. These techniques can be broadly classified into deterministic, probabilistic, and modern approaches [, 8]. A general schematic outline of the data linkage process is shown in Figure. As most real-world data collections contain noisy, incomplete and incorrectly formatted information, data cleaning and standardisation are important preprocessing steps for successful data linkage (and also before data can be loaded into data warehouses or used for further analysis [6]). A lack of good quality data can be one of the biggest obstacles to successful data linkage and deduplication [3]. The main task of data cleaning and standardisation is the conversion of the raw input data into well defined, consistent forms, as well as the resolution of inconsistencies in the way information is represented and encoded []. If two databases, A and B, are to be linked, potentially each record from A has to be compared with all records from B. The total number of potential record pair comparisons thus equals the product of the size of the two databases, A B, with denoting the number of records in a database. Similarly, when deduplicating a database A, the total number of potential comparisons is A ( A )/, as each record potentially has to be compared with all others. The performance bottleneck in a data linkage or deduplication system is usually the expensive detailed comparison of fields (or attributes) between pairs of records [3, ], making it unfeasible to compare all pairs when the databases are large. For example, linking two databases with, records each would result in (ten billion) record pair comparisons. On the other hand, the maximum number of true matches corresponds to the number of records in the smaller database (assuming there are no duplicate records in the databases, and one record in database A can be matched with only one record in database B, and vice versa). Therefore, while the computational efforts increase quadratically, the number of potential true matches only increases linearly when linking larger databases. This also holds for deduplication, where the number of duplicate records is always less than the number of records in a database []. To reduce the large amount of potential record pair comparisons, traditional data linkage techniques [7, 9] employ blocking [3]: a single record attribute, or a combination of attributes (called the blocking variable or blocking key) is used to split the databases into blocks. All records having the same value in the blocking key will be inserted into one block, and candidate record pairs are then generated only from records within the same block. While the aim of blocking is to reduce the number of record pair comparisons made as much as possible (by eliminating pairs of records that obviously are not matches), it is important that no potential true matches are removed by the blocking process. Two main issues have to be considered when blocking keys are defined. First, the error characteristics of the attributes used will influence the quality of the generated candidate record pairs. Ideally, attributes containing the fewest errors or missing values should be chosen, as any error in an attribute value in a blocking key will potentially result in a record being inserted into the wrong block, thus missing true matches. The second issue to consider is the frequency distribution of values in the attributes used as blocking keys, as this will affect the size of the blocks that will be generated. For linkage, if m records are in a block from database A and n records in the same block from database B, then m n record pairs will be generated from this block; while for deduplication, if m records are in a block, m(m )/ record pairs will be generated (note that all blocks that contain only one record will not contribute a record pair). The largest blocks will dominate the execution time of the comparison step, as they will contribute very large numbers of record pairs. When defining blocking keys, there is also a trade-off to be considered. Having a large number of smaller blocks will mean that fewer candidate record pairs will be generated, but likely result in more true matches being missed; while blocking keys that result in larger blocks will generate more record pairs that likely will cover more of the true matches, but at the cost of having to compare many more pairs. Some of the blocking techniques discussed in this paper allow the size of blocks to be controlled directly through parameters, while for other techniques the block sizes mainly depend upon characteristics of the data. All the candidate record pairs generated by the blocking process are compared using a variety of comparison functions applied to one or more (or a combination of) records attributes. These functions can be as simple as an exact string or a numerical comparison, can take variations and typographical errors into account [, 8], or can be as complex as a distance comparison based on look-up tables of geographic locations (longitudes and latitudes). Each comparison returns a numerical value (called matching weight), often positive for agreeing and negative for disagreeing values. A vector is formed for each compared record pair containing all the values calculated by the different comparison functions. These vectors are then used to classify record pairs into matches, non-matches, and possible matches (depending upon the decision model used) [, 7]. Record pairs that were removed by the blocking process are classified as non-matches without being compared explicitly. Current research into blocking can be categorised into two areas. The first is developing new, and improving existing, blocking techniques with the aim of making them more scalable (i.e. allow linking of very large databases with many millions of records) while enabling high-quality linkage results, by generating (i.e. not removing) as many true matches as possible. Besides the standard blocking ap-

3 proach, new techniques developed in the past decade include the sorted neighbourhood approach [], q-gram based blocking [3, 6,, ], high-dimensional overlapping clustering [4, 4], mapping strings into multi-dimensional space followed by similarity joins [], and suffix-array based blocking []. These techniques will be presented in detail in Section and evaluated experimentally in Section 3. The second area of current blocking research is into approaches that learn how to optimally choose blocking keys. Traditionally, the choice of blocking keys is made manually by domain and data linkage experts. Two approaches based on supervised learning of blocking keys have recently been presented. They employ predicate-based formulations of learnable blocking functions [5] and the sequential covering algorithm which discovers disjunctive sets of rules [5], respectively. The aim of both approaches is to find blocking keys such that the number of true matches in the candidate record pairs is maximised, while keeping the total number of candidate pairs as small as possible. They rely on training examples, i.e. pairs of true matched and true non-matched record pairs, which are often not available in real world situations, or have to be prepared manually (an expensive and time consuming process).. Contributions We have recently presented an overview and evaluation of six blocking techniques in [9], where we implemented a modification of two techniques, replacing global threshold values with nearest-neighbour parameters. This resulted in significantly more true matched record pairs being generated by the blocking process, and also much improved robustness of these blocking techniques to changes in parameter settings. We discussed the quality of the candidate record pairs generated by the various evaluated blocking techniques. In this paper, we focus on the performance and scalability of various blocking techniques. Specifically, we evaluate how processing times and amounts of main memory required scale with the size of the databases to be linked or deduplicated. To the best of our knowledge, no such experimental evaluation of different blocking techniques within a common framework has been published. We are only aware of one earlier study [3], similar to our experiments presented in [9], that compared the quality of candidate record pairs produced by three blocking techniques.. BLOCKING TECHNIQUES When linking large databases, blocking is essential in order to make data linkage and deduplication possible at all, and to improve the efficiency of the linkage process. Blocking, however, will reduce the linkage quality [], as it is very likely that some true matched record pairs will be removed by the blocking process, if records are not being inserted into the correct block (or blocks) due to variations and errors in their blocking key values. The blocking step from Figure can be split into the following two sub-steps.. Build: All records from the database (or databases) are read, the blocking key values are created, and the records are inserted into a suitable index data structure. For most blocking techniques, a basic inverted index [3] can be used. The blocking key values will become the keys of the inverted index, and the record identifiers of all records that have the same blocking key value will be inserted into the same inverted index list. The record attribute values required in the comparison step will be inserted into another index data structure, for example a hash-table with record identifiers as keys (which can be main memory or disk based). Additional data structures will be built as required by a blocking technique (as described below).. Retrieve: Record identifiers are retrieved from the index data structure block by block, and the corresponding record attribute values required for the comparisons are retrieved. Candidate record pairs are then generated from these records. For linkage, all records in a block from one database will be paired with all records in the same block from the other database, while for deduplication each record in a block will be paired with all other records in the same block. The generated candidate record pairs are then compared, and the resulting vectors containing the numerical comparison values are given to a classifier. In this paper, we are interested in performance and scalability of the blocking process, i.e. how many candidate record pairs are generated, how many of these are true matches, how much memory is required by the index data structures, and how long it takes to generate the candidate record pairs. For some blocking techniques, the above two steps, together with the record pair comparisons, can be performed in an overlapping fashion, as discussed in detail in Section.6 below. Others require that the Build step is completed before the Retrieve step can be started. We will use the following notation: n A and n B are the total number of records in databases A and B. The average length in characters of blocking key values is denoted with c, while r is the length of a record identifier, p the length of a pointer, and f the length of a floating-point number (all in bytes). B A and B B are the sets of unique blocking key values generated by a blocking technique from the records in databases A and B, respectively. The function block(a, b i) returns the record identifiers from a block with blocking key value b i B A from database A. The number of elements in a set is denoted with.. Standard blocking This technique has been used in data linkage for many years [7]. All records having the same value in a blocking key are inserted into the same block, and only records within the same block are then compared with each other. Each record will be inserted into one block only, and thus n A = P b i B A block(a, b i) and n B = P b j B B block(b, b j). The number of record pairs generated by standard blocking for linkage is X b i (B A B B ) and for deduplication it is X block(a, b i) block(b, b i), b i B A block(a, b i) ( block(a, b i) )/. Standard blocking can be implemented efficiently using an inverted index [3], as described in the Build step in Section above. In the Retrieve step, all record identifiers from a single inverted index list are extracted and record pairs are generated from them.

4 The two major drawbacks of standard blocking are errors in the blocking key values that will result in records being inserted into the wrong block, and the sizes of the blocks generated. As these sizes depend upon the frequency distributions of the blocking key values, it is difficult to predict the total number of candidate record pairs generated by this blocking technique. In the ideal situation, where all blocks contain the same number of records, i.e. block(a, b i) = n A/ B A for all b i B A (and similarly for database B), the number of candidate record pairs generated for linkage will be [5] B A B B na B A nb B B, and for deduplication it will be B A na B A na B A «/. This situation will be very hard to achieve in practice, and thus the number of pairs generated will likely be larger, as larger blocks will generate an increased number of pairs. The memory requirement for the inverted index in standard blocking is (c+p)( B A + B B )+r(n A +n B) for linkage and (c + p) B A + rn A for deduplication.. Sorted neighbourhood First proposed in the mid 99s [], the basic idea behind this technique is to sort the blocking key values once the basic inverted index has been created, and to move a window of fixed size w sequentially over these sorted values. Sorting the blocking key values (which will require O( B A log( B A )) and O( B B log( B B )) steps) and using a window of size w > will insert records into the same block that have similar, not just exactly the same, blocking key values. Blocks are formed from the record identifiers in the inverted index lists of all blocking key values in the current window. We denote the union of all blocking key values from two databases A and B with B AB = B A B B, and the elements i to j (with i j) in the sorted blocking key values from both databases with B AB[i, j]. The number of candidate record pairs generated by this blocking technique with a window size w for linkage is B AB P w+ k= and for deduplication it is B A P w+ k= [ [ b i B AB [k,k+w ] [ b j B AB [k,k+w ] b i B A [k,k+w ] b i B A [k,k+w ] block(a, b i) block(b, b j), block(a, b i) block(a, b i) A /. If the window size w =, the sorted neighbourhood technique becomes standard blocking as described above. There- Note that our implementation of this technique is different from the one originally presented [], as we only sort the blocking key values rather than all the records in the database. fore, for all window sizes w >, the generated record pairs will be a super-set of the pairs generated by standard blocking. In general, for two window sizes w i and w j, with w i < w j, all record pairs generated with window size w i will also be in the pairs generated with w j. However, the larger the window size, the larger the blocks become. A disadvantage of this method is that, similarly to standard blocking, larger blocks will dominate the performance, as large numbers of record pairs will be generated. In the ideal situation, where all blocks contain the same number of records, the number of candidate record pairs generated for linkage will be [3, 5] ««( B AB w + ) w na w nb, B A B B and for deduplication it will be ««( B A w + ) w na w na B A B /. A The other major disadvantage is that the sorting process assumes that the beginning of blocking key values is errorfree (especially the initial character), as otherwise sorting will move similar values away from each other. For example, if the blocking key values are given names, christina and kristina will very likely be too far away in the sorted values to be covered by the same window, even though they are very similar. As with standard blocking, it is good practice to define several blocking keys, resulting in several sorted list of values [], and to use pre-processing, like phonetic encodings [8], to bring similar values close together. The memory requirement for the inverted index in the sorted neighbourhood blocking technique is the same as with standard blocking for both linkage and deduplication..3 Canopy clustering The idea behind this recently developed technique [4, 4] is to use a computationally cheap similarity measure to efficiently construct high-dimensional, overlapping clusters, called canopies, and to then extract blocks from these clusters. As cheap similarity measure, the Jaccard or TF- IDF/cosine similarities (as used in information retrieval) [3] can be used. Both are based on q-grams (or more generally, tokens [4]) and can be implemented efficiently using an inverted index with the q-grams (rather than blocking key values) as index keys. For both similarity measures, in the Build step the blocking key values are first converted into q-gram lists, and then each q-gram is inserted into an inverted index. For TF-IDF/cosine similarity additional information has to be calculated and stored: for each unique q-gram the number of records that contain this q-gram, i.e. its term frequency (TF); and within the inverted index the document frequency (DF) for each q-gram in each record (i.e. the frequency of a q-gram in a blocking key value). Once all records in a database have been read and processed, the TF and DF values can be normalised and the inverse document frequency (IDF) can be calculated. No such frequency information or normalisation is required for Jaccard similarity. In the Retrieve step, all records are initially inserted into a pool of candidate records. Canopy clusters are then generated by randomly selecting a record (which will become the centroid of the cluster) from the pool, and adding all records from the pool into the cluster that are similar to

5 this centroid record. In the traditional approach [4, 4], all records closer than a loose similarity value threshold, t loose, are inserted into the canopy. Of these, all records within a tight similarity threshold t tight (with t tight t loose ), are removed from the candidate pool of records. This process is repeated until no candidate records are left in the pool. Note that if both thresholds t loose and t tight are set to a value of. (i.e. only exact similarity), the canopy clustering technique reduces to standard blocking. Similar to standard blocking and the sorted neighbourhood approach, canopy clustering with global thresholds t loose and t tight will result in blocks (i.e. canopy clusters) of different sizes (even though the TF-IDF/cosine similarity measure to some degree adjusts similarity values according to the frequency of the q-grams in the blocking key values). We have modified [9] the canopy clustering approach by replacing the two global thresholds with two neighbouring based parameters: n tight and n loose (with n tight n loose ), both being integer numbers. n loose is the number of closest records to the randomly chosen centroid record (according to the similarities of their blocking key values) that will be inserted into the canopy cluster, and the n tight closest records of these will then be removed from the pool of candidate records. This approach results in blocks of similar sizes, with the maximum size known beforehand as n loose. This not only prevents very large blocks, but also allows an estimate of the number of record pairs generated. Experiments [9] showed that using such nearest-neighbour parameters can result in a significant increase in the quality of the candidate record pairs generated (i.e. more true matches) compared to global thresholds, and also in a higher robustness of the canopy clustering blocking technique with regard to changes in parameter settings. Predicting the number of candidate record pairs generated by the global threshold based approach is very difficult, as it depends upon the actual values in the blocking keys as well as the chosen threshold values. In one extreme case, one can imagine that all values are very similar (for example having only one character difference) and thus all records are being inserted into the same canopy, while on the other hand all blocking key values can be very different from each other and result in similarities below t loose, so that each record forms its own block. Using nearest-neighbour based parameters, on the other hand, allows a better estimation of the number of record pairs generated, as the number of blocks (canopy clusters) corresponds to B A n tight and B B n tight, and each block will contain n loose records. If we denote the set of canopies in common by databases A and B as C, with C min( B A B n tight, B n tight ), and assuming each canopy contains n loose records from both databases, then the number of candidate record pairs generated for linkage is and for deduplication it is C (n loose ), B A n tight n loose (n loose )/. The memory requirement for the inverted index in canopy clustering is depending upon the number of unique q-grams. A blocking key value of length l characters contains (l q+) q-grams, and if Q is the set of all unique q-grams generated, the amount of memory required for Jaccard similarity is Q (p+q)+r(l q +)(n A +n B) for linkage and Q (p+q)+ r(l q + )n A for deduplication. The TF-IDF/cosine similarity additionally requires storage of DF and IDF values, which amounts to f( Q + (l q + )(n A + n B)) for linkage and f( Q + (l q + )n A) for deduplication..4 String map based blocking This technique [] is based on the idea of mapping the blocking key values (assumed to be strings) to objects in a multi-dimensional Euclidean space, such that similarities (or distances, like edit-distance [8]) between pairs of strings are preserved; followed by finding pairs of objects in this space that are similar to each other. In [], the authors modify the FastMap [6] algorithm into StringMap, which has a linear complexity in the number of strings to be mapped. This algorithm iterates over d dimensions; for each it finds two pivot strings and then forms orthogonal directions and calculates the coordinates of all other strings on these directions. In the second step, the authors use R-trees as multidimensional data structure in combination with a queue to efficiently retrieve pairs of similar strings []. Choosing an appropriate dimensionality d is done using a heuristic approach that tries a range of dimensions and selects the one that minimises a cost function (dimensions between 5 and 5 typically seem to achieve good results) []. In our implementation, we replaced the R-tree data structure with a grid based index [], as most tree-based multidimensional index structures degrade rapidly with increasing dimensionality. It is reported [] that with more than 5 to dimensions, in most tree based indices all objects in an index will be accessed when performing similarity searches. Our grid based index works by having a regular grid of dimensionality d implemented as an inverted index in each dimension (i.e. all objects mapped into the same grid cell in a dimension are inserted into the same inverted index list). The Retrieve step is implemented in a similar way as in canopy clustering described above. An object (blocking key value) is randomly picked from the pool of (initially all) objects, and the objects in the same, as well as in neighbouring grid cells, are retrieved from the index. Similar to canopy clustering, two thresholds t loose and t tight can be used to insert similar objects into clusters. Alternatively, nearestneighbour parameters n loose and n tight can be used in the same way as described with canopy clustering above [9]. Predicting the number of candidate record pairs generated by the global threshold based approach is very difficult, as it depends upon the distribution of the objects in the multidimensional space (which depend upon the values of the blocking keys), as well as the chosen threshold values. For the nearest neighbour based approach, however, the same estimations as described in Section.3 can be used. The memory requirement for the string map based blocking include the inverted index (which is the same as in standard blocking), plus the d dimensional array containing the multi-dimensional mappings of the blocking key values, and the index grid. The total memory requirement is (c + p + fd + cd)( B A + B B ) + r(n A + n B) for linkage and (c + p + fd + cd) B A + rn A for deduplication..5 Suffix array based blocking This technique has recently been proposed as an efficient domain independent method for multi-source information integration []. The basic idea is to insert the blocking key

6 Table : Number of candidate record pairs generated (smallest numbers per column in bold) Deduplication Blocking technique, 5,, 5, 5,, 5,, 5, 5,,8 3,537 89,43 545,69,66,57 3, 86,796 5,64,9,9 Sorted-Window-3,578 8,335 5,4,5,44 3,743,94,87 8,38 3,998,5,65 3,37,56 Sorted-Window-7 4,35 74,385 47,648,,54 6,749,68 4,644 74, 46,796,94,99 6,4,76 Suffix-Array 4,4,39 43,6 99,5 74,74 4,43,753 43,69 99,4 76,39 String-Map 6,843 83,46 9,39 748,564,53,8 4,96 75,58 73,85 694,594,4,64 Canopy-NN-Jaccard 8,75 5,565 39,73 648,5,37,84 8,3 5,694 3,74 66,7,7,636 Canopy-NN-TFIDF 5,8 8,535 89,53 745,38,59,895 3,5 73,338 73,789 69,98,6,88 Canopy-Thres-Jaccard,3 3,66 89, ,389,3,8,579 3,397 87,95 55,363,97,93 Canopy-Thres-TFIDF,776 4,37 57, ,59 3,996,899,9 37,96 48,34 93,74 3,464,999 Linkage Blocking technique, 5,, 5, 5,, 5,, 5, 5, 467 7,689 6,836 63,83 676, ,39 35,86,896 84,8 Sorted-Window-3 9,593 59,39 34,6 47,686,46,9 7,899 5,86 6, ,38,66,83 Sorted-Window-7,956 34,68 96,7 993,7,834,35 8,677,48 7, 969,579 3,93,3 Suffix-Array,587,3 9,3 44,977 8,6,434 6,869 3,65 73,8 36,5 String-Map 5,5 6,84 57,957 4, ,58 6,396 3,7 75,7 85,3 977,848 Canopy-NN-Jaccard,54 5,37 39,998 9,567 7,54 3,5 9,78 53,74 4,65 98,884 Canopy-NN-TFIDF 4,757 4,79 55,54,4 77,44 5,855 3,7 73,458 8,46 97,33 Canopy-Thres-Jaccard 468 7,749 6,963 64,38 68, 57 9,39 36,5 4, ,579 Canopy-Thres-TFIDF 637,435 47,766 87,575,8, 737 5,78 63,56 356,988,5,967 values and their suffixes into a suffix array based inverted index. A suffix array contains strings or sequences and their suffixes in a sorted order. Blocking based on suffix arrays has successfully been used on both English and Japanese bibliographic databases [], where suffix arrays were created using both English names and Japanese characters. For this blocking technique, only suffixes down to a minimum length, min len, are inserted into the suffix array. For example, for a blocking key value christen and a minimum length min len = 4, the values christen, hristen, risten, isten, and sten, will be inserted, and the identifiers of all records that have this blocking key value will be inserted into the corresponding inverted index lists. Similar to canopy clustering, the identifier of a record will be inserted into several blocks (i.e. inverted index lists), according to the length of its blocking key value. A blocking key value of length l characters will be inserted into (l min len + ) index lists. In order to limit the maximum block size, only values in the suffix array inverted index are used that have less than a maximum number, max block, of record identifiers in their corresponding list. For example, if the suffix array value ten has records in its corresponding inverted index list, and the value sten has only 5 records, and max block = 6, then ten is considered to be too general and is not used in the Retrieve step (when blocks are extracted and record pairs are generated), as it would generate too many pairs. Assuming B A is the set of unique blocking key values generated by a database A, then the minimum length of the suffix array is ( B A + l min len), which is the (unlikely) case that all blocking key values (assumed to be l characters long) only differ in their first character and have all of their suffixes in common (for example gillian, jillian and killian all have the same suffixes illian, llian, and so on). On the other hand, the maximum length of the suffix array is B A (c min len+), with c the average length of the blocking key values. This is the case where all suffixes (down to min len) of all blocking key values are different. Both these cases will unlikely happen in practice, so the length of the suffix array will be somewhere in between. The maximum number of candidate record pairs will be generated if all inverted index lists contain max block records, which means every entry in the suffix array has to appear in max block records. If S A and S B are the suffix arrays generated from the blocking key values B A and B B from databases A and B, respectively; and s i denotes an entry in these suffix arrays, then the number of candidate record pairs generated by suffix array based blocking is X block(a, s i) block(b, s i) s i (S A S B ) for linkage, and for deduplication it is X s i S A block(a, s i) ( block(a, s i) )/. With an average length of c characters of the original blocking key values, the average length of the suffix array entries becomes (c + ), as the shortest entries min len will be min len characters long. The memory requirement min len for suffix array based blocking therefore is (c + + p)( S A + S B ) + r(c min len + )(n A + n B) for linkage min len and (c + + p) S A + r(c min len + )n A for deduplication..6 Implementation issues For standard blocking and the sorted neighbourhood technique, the blocking sub-steps Build and Retrieve, together with the record pair comparison and classification steps, can be implemented in an overlapping fashion. For deduplication, an inverted index data structure is built while records are read from a database, their blocking key values are extracted, and then inserted into the index. The current record is compared with all previously read and indexed records having the same blocking key value. For linkage, the Big- Match [3] approach can be taken, where first the smaller

7 Table : Reduction ratio results (in percentages, largest values per column in bold) Deduplication Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF Linkage Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF database is loaded and the inverted index data structures are built in main memory, including all record attribute values required in the comparison step. Each record of the large database is then read, its blocking key values are extracted, and all records in the same block from the small database are retrieved from the index data structure and compared with the current record. This approach performs only one single pass over the large database and does not require indexing, sorting or storing of any of its records. The canopy clustering, string map and suffix array based blocking techniques do not allow such an overlapping implementation, as they require their data structure to first be completely built before blocks can be retrieved. An advantage of these three blocking techniques is, however, that their index data structures can either be kept in main memory or be stored on disk. This allows trading run times versus memory resources, as well as persistent storage of the blocking indices (and thus performing the Retrieve step many times without having to repeat the Build step). 3. EXPERIMENTAL EVALUATION The aim of the experiments described here was to empirically evaluate the performance (quality of the candidate record pairs generated) and scalability (processing times and main memory required) of several recently developed blocking techniques with regard to the size of the database to be deduplicated or linked. Using a common framework allowed us to investigate questions like: What are the memory requirements of blocking techniques? Which techniques are scalable to very large databases? Which technique performs best for databases with certain error characteristics? Here we will present results for the total number and quality of the candidate record pairs generated, as well as the time and main memory required by the different blocking techniques. We have chosen parameter settings that have achieved good experimental results earlier [9]. We have evaluated the blocking techniques described in Sections. to.5 above using data sets of different sizes and with different error characteristics, as described below. All blocking techniques were implemented in the Febrl [] record linkage system, which is written in the Python programming language. The experiments were carried out on a Dell Optiplex GX8 with a Intel Pentium 3 GHz CPU and Gigabytes of main memory, running Linux.6.8 and using Python Test data sets and blocking key definition In order to evaluate how the quality (number of true matches in the candidate record pairs) and scalability (total number of candidate record pairs) of a blocking technique is affected by the size and error characteristics of the databases to be linked, we created synthetic data sets containing from, to 5, records using the Febrl data set generator [7]. This generator works by first creating original records based on frequency tables containing real world names (given- and surname) and addresses (street number, name and type; postcode; suburb and state names), followed by the random generation of duplicates of these records based on modifications (like inserting, deleting or substituting characters, and swapping, removing, inserting, splitting or merging words), also based on real error characteristics. We generated two data sets for deduplication and two for linkage for each size, with different error characteristics as described below. For the linkage data sets, the original and duplicate records were separated into two files. Clean data sets: 8% original and % duplicate records; up to three duplicates for one original record, maximum one modification per attribute, and maximum three modifications per record. Dirty data sets: 6% original and 4% duplicate records; up to nine duplicates for one original record,

8 Table 3: Pairs completeness results (in percentages, largest values per column in bold) Deduplication Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF Linkage Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF maximum three modifications per attribute, and maximum ten modifications per record. We ran experiments with various blocking key definitions, and here we present results from the best performing definition [9], where three different blocking keys were defined, corresponding to three runs (or iterations) in traditional blocking [7]. The first are the Soundex [8] encoded surnames, the second are the first four characters of given names concatenated with the first two digits of postcodes, and the third blocking key definition are the two last postcode digits concatenated with Double-Metaphone [8] encoded suburb names. 3. Quality and complexity measures The quality and complexity of blocking techniques have traditionally been evaluated using the pairs completeness and reduction ratio measures [, 5], as defined below. Following [5], let n M and n U be the total number of matched and un-matched record pairs, respectively, such that n M + n U = A B for linkage of two databases A and B, and n M + n U = A ( A )/ for deduplication of a database A. Next, let s M and s U be the number of true matched and true non-matched record pairs generated by a blocking technique, respectively, with usually (s M + s U) (n M + n U). P C = sm n M RR =. (sm + su) (n M + n U ) Pairs completeness (PC) is the number of true matched record pairs generated by a blocking technique divided by the total number of true matched pairs. It measures how effective a blocking technique is in generating true matched record pairs. Pairs completeness corresponds to the recall measure as used in information retrieval [3]. The reduction ratio (RR) measures the reduction of the comparison space, i.e. the more record pairs are removed by a blocking technique the higher the reduction ratio value becomes. However, reduction ratio does not take the quality of the generated candidate record pairs into account (how many are true matches or not). Besides RR and PC, we also report the total number of candidate record pairs generated, the total processing times used, and the amount of main memory required by a technique for both the blocking and comparison steps. None of the computational resources is taken into account by either RR or PC. 3.3 Experimental results We ran blocking experiments with the four different data set series described above on all blocking techniques presented in Section. Tables to 5 show the number of candidate record pairs generated, reduction ratio, pairs completeness, amount of main memory required, and times used. We selected parameter values that are either commonly used (like -grams) or that showed good results in our preliminary experiments compared to other settings for the same technique [9]. For the sorted neighbourhood approach, we choose window sizes w = 3 and w = 7, for the suffix array based approach we selected the minimum value length min len = 3 and the maximum block size block size =, and for string map blocking we chose a dimensionality d = 5 and two nearest-neighbour based parameter values n loose = 8 and n tight = 4. All four canopy clustering based techniques presented are based on -grams (q = ), with nearest neighbour parameter values n loose = 8 and n tight = 4, and threshold values t loose =.8 and t tight =.9. The best result (largest percentages and smallest absolute values) in each table column is highlighted in boldface. Missing results indicate that these experiment were computationally not feasible, as more than one million record pairs were generated and more than Gigabytes of main memory was required, resulting in swapping.

9 Table 4: Main memory required (in Megabytes, smallest numbers per column in bold) Deduplication Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window , ,8 Sorted-Window , ,657 Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF Linkage Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF DISCUSSION Looking at Table it is clear that the various blocking techniques generate very different numbers of candidate record pairs. With the exception of suffix array based blocking, all recently developed techniques generate more pairs than standard blocking, which is due to them inserting a record identifier into more than one block. This can best be seen by the sorted neighbourhood approach, which generates more pairs as larger window sizes are used. With canopy clustering, using Jaccard similarity results in significantly fewer record pairs compared to using TF-IDF/cosine similarity for both global threshold and nearest-neighbour parameters, with generally the global thresholds method generating fewer pairs than the nearest-neighbour approach. The reduction ratio (RR) results are correlated with the number of record pairs. For all blocking techniques, and for both deduplication and linkage, the RR values are very high for all data set sizes. Having similar RR values for all data set sizes means that the number of candidate record pairs generated is increasing at a quadratic rate, making a blocking technique not scalable. For most techniques, however, RR values are actually increasing slightly with larger data sets, which means these blocking techniques are reducing the complexity of the blocking process to below quadratic. The quality of the candidate record pairs generated is shown in Table 3 using the pairs completeness (PC) measure. With increasingly large data sets, the PC results for many blocking techniques decrease, especially for the dirty data sets that contain more modifications. For the suffix array and sorted neighbourhood techniques these drops in PC are significant, and indicate that they are not capable to achieve good quality linkage results with large databases that containing many errors and modifications. Canopy clustering with global thresholds seems to be the only technique that achieves high PC results for all data set sizes, with TF- IDF/cosine similarity slightly outperforming Jaccard similarity. Surprisingly, the standard blocking approach also results in fairly good PC results over all data sets sizes. The main memory and timing results presented in Tables 4 and 5 are the totals for both the blocking and comparison steps and include the memory used by the record pair comparison vectors. They thus give an upper limit of the resources required, as the comparison vectors could be stored on disk and be re-loaded for the classification step (resulting in longer run times). Main memory required correlates with the number of candidate record pairs generated, and increases more than linearly with increasing data set sizes. The sorted neighbourhood techniques requires most memory (with increasing window size) as it generates the largest numbers of candidate record pairs, followed by string map based blocking. For canopy clustering, TF-IDF/cosine similarity uses more memory due to the extra storage required for term and inverse document frequencies of q-grams. For deduplication, suffix array based blocking is by far the fastest method compared to all others (it however generates the smallest number of record pairs and has the lowest PC results), followed by standard blocking and threshold based canopy clustering using Jaccard similarity; while for linkage standard blocking is the fastest technique, followed by threshold based canopy clustering using Jaccard similarity and suffix array based blocking. String map based blocking is significantly slower than all other techniques, especially for larger data sets. Jaccard similarity outperforms TF- IDF/cosine similarity based canopy clustering in all cases, as it allows pruning of candidate records early in the similarity calculation compared to TF-IDF/cosine similarity [3]. It would thus be the preferred method, compared to TF- IDF/cosine similarity, as both have comparable PC results. To investigate the scalability of the presented blocking techniques, in Figures and 3 we show results for the time required for a single record pair comparison (i.e. the timing results from Table 5 divided by the number of candidate record pairs generated from Table ). As can be seen, most

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,