Performance and scalability of fast blocking techniques for deduplication and data linkage

Size: px
Start display at page:

Download "Performance and scalability of fast blocking techniques for deduplication and data linkage"

Transcription

1 Performance and scalability of fast blocking techniques for deduplication and data linkage Peter Christen Department of Computer Science The Australian National University Canberra ACT, Australia ABSTRACT An increasingly important step in many data management and analysis projects is the matching and aggregation of records from several databases, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. Similarly important is finding duplicate records in a single database. A main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. For deduplication, each record potentially has to be compared to all others. Various techniques, collectively known as blocking, have been developed in the past to deal with this quadratic complexity. In this paper, we discuss several recently developed blocking techniques and empirically evaluate them within a common framework with regard to their performance (quality of the candidate record pairs generated) and scalability (processing times and main memory required). As our experiments show, some techniques require large main memory data structures or have long processing times, which makes their application to very large databases problematic.. INTRODUCTION With many businesses, government agencies and research projects collecting massive amounts of data, techniques that allow efficient processing, analysing and mining of large databases have in recent years attracted interest from both academia and industry. An increasingly important task in the A related paper, titled Improving data linkage and deduplication quality through nearest-neighbour based blocking [9], that describes similar experiments has been submitted by the author to KDD 7. It does not, however, address complexity and scalability issues at all and is focussed on modifications to existing blocking techniques that improve the quality of the candidate record pairs generated. The paper can be downloaded from: Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB 7, September 3-8, 7, Vienna, Austria. Copyright 7 VLDB Endowment, ACM /7/9. data pre-processing step of many data management and analysis projects is detecting and removing duplicate records that relate to the same entity within one database. Similarly, matching records relating to the same entity from several databases is often required as information from multiple sources needs to be integrated, combined or linked in order to enrich data and allow more detailed data mining studies. The aim of such linkages is to match and aggregate all records relating to the same entity, such as a patient, a customer, a business, a product, or a genome sequence. Data (or record) linkage and deduplication are also called object identification, merge/purge processing, data matching, data cleansing, duplicate detection, entity resolution, list washing or ETL (extraction, transformation and loading). They can be used to improve data quality and integrity, to allow re-use of existing data sources for new studies, and to reduce costs and efforts in data acquisition [6, 7]. In the health sector, for example, linked data might contain information that is needed to improve health policies, and that traditionally has been collected with time consuming and expensive survey methods [3]. Statistical agencies routinely link census data for further analysis [8, 9], while businesses often deduplicate their databases to compile mailing lists, or link them for collaborative e-commerce projects or when integrating databases from different sources. Within taxation offices and departments of social security, data linkage is used to identify people who register for assistance multiple times or who work and collect unemployment benefits. Another application of current interest is the use of data linkage in crime and terror detection. Security agencies and crime investigators increasingly rely on the ability to quickly access files for a particular individual, which may help to prevent crimes and terror by early intervention. The problem of finding similar entities does not only apply to records which refer to persons. In bioinformatics, data linkage can help find genome sequences in large data collections that are similar to a new, unknown sequence at hand. The removal of duplicates in the results returned by Web search engines and automatic text indexing systems, where copies of documents (for example bibliographic citations) have to be identified and filtered out before being presented to the user, is increasingly important. Finding and comparing consumer products from several online stores is another application of growing interest [4]. As product descriptions are often slightly different, linking them becomes difficult. If unique entity identifiers (or keys) are available in all the databases to be linked, then the problem of linking at

2 Database A Database B Matches Cleaning and standardisation Cleaning and standardisation Classification Possible matches Blocking Record pair comparison Non matches Figure : General data linkage process. The output of the blocking step are candidate record pairs, while the comparison step produces vectors with numerical matching weights. the entity level becomes trivial: a simple database join is all that is required [6]. However, in most cases no unique keys are shared by all databases, and more sophisticated linkage techniques need to be applied. These techniques can be broadly classified into deterministic, probabilistic, and modern approaches [, 8]. A general schematic outline of the data linkage process is shown in Figure. As most real-world data collections contain noisy, incomplete and incorrectly formatted information, data cleaning and standardisation are important preprocessing steps for successful data linkage (and also before data can be loaded into data warehouses or used for further analysis [6]). A lack of good quality data can be one of the biggest obstacles to successful data linkage and deduplication [3]. The main task of data cleaning and standardisation is the conversion of the raw input data into well defined, consistent forms, as well as the resolution of inconsistencies in the way information is represented and encoded []. If two databases, A and B, are to be linked, potentially each record from A has to be compared with all records from B. The total number of potential record pair comparisons thus equals the product of the size of the two databases, A B, with denoting the number of records in a database. Similarly, when deduplicating a database A, the total number of potential comparisons is A ( A )/, as each record potentially has to be compared with all others. The performance bottleneck in a data linkage or deduplication system is usually the expensive detailed comparison of fields (or attributes) between pairs of records [3, ], making it unfeasible to compare all pairs when the databases are large. For example, linking two databases with, records each would result in (ten billion) record pair comparisons. On the other hand, the maximum number of true matches corresponds to the number of records in the smaller database (assuming there are no duplicate records in the databases, and one record in database A can be matched with only one record in database B, and vice versa). Therefore, while the computational efforts increase quadratically, the number of potential true matches only increases linearly when linking larger databases. This also holds for deduplication, where the number of duplicate records is always less than the number of records in a database []. To reduce the large amount of potential record pair comparisons, traditional data linkage techniques [7, 9] employ blocking [3]: a single record attribute, or a combination of attributes (called the blocking variable or blocking key) is used to split the databases into blocks. All records having the same value in the blocking key will be inserted into one block, and candidate record pairs are then generated only from records within the same block. While the aim of blocking is to reduce the number of record pair comparisons made as much as possible (by eliminating pairs of records that obviously are not matches), it is important that no potential true matches are removed by the blocking process. Two main issues have to be considered when blocking keys are defined. First, the error characteristics of the attributes used will influence the quality of the generated candidate record pairs. Ideally, attributes containing the fewest errors or missing values should be chosen, as any error in an attribute value in a blocking key will potentially result in a record being inserted into the wrong block, thus missing true matches. The second issue to consider is the frequency distribution of values in the attributes used as blocking keys, as this will affect the size of the blocks that will be generated. For linkage, if m records are in a block from database A and n records in the same block from database B, then m n record pairs will be generated from this block; while for deduplication, if m records are in a block, m(m )/ record pairs will be generated (note that all blocks that contain only one record will not contribute a record pair). The largest blocks will dominate the execution time of the comparison step, as they will contribute very large numbers of record pairs. When defining blocking keys, there is also a trade-off to be considered. Having a large number of smaller blocks will mean that fewer candidate record pairs will be generated, but likely result in more true matches being missed; while blocking keys that result in larger blocks will generate more record pairs that likely will cover more of the true matches, but at the cost of having to compare many more pairs. Some of the blocking techniques discussed in this paper allow the size of blocks to be controlled directly through parameters, while for other techniques the block sizes mainly depend upon characteristics of the data. All the candidate record pairs generated by the blocking process are compared using a variety of comparison functions applied to one or more (or a combination of) records attributes. These functions can be as simple as an exact string or a numerical comparison, can take variations and typographical errors into account [, 8], or can be as complex as a distance comparison based on look-up tables of geographic locations (longitudes and latitudes). Each comparison returns a numerical value (called matching weight), often positive for agreeing and negative for disagreeing values. A vector is formed for each compared record pair containing all the values calculated by the different comparison functions. These vectors are then used to classify record pairs into matches, non-matches, and possible matches (depending upon the decision model used) [, 7]. Record pairs that were removed by the blocking process are classified as non-matches without being compared explicitly. Current research into blocking can be categorised into two areas. The first is developing new, and improving existing, blocking techniques with the aim of making them more scalable (i.e. allow linking of very large databases with many millions of records) while enabling high-quality linkage results, by generating (i.e. not removing) as many true matches as possible. Besides the standard blocking ap-

3 proach, new techniques developed in the past decade include the sorted neighbourhood approach [], q-gram based blocking [3, 6,, ], high-dimensional overlapping clustering [4, 4], mapping strings into multi-dimensional space followed by similarity joins [], and suffix-array based blocking []. These techniques will be presented in detail in Section and evaluated experimentally in Section 3. The second area of current blocking research is into approaches that learn how to optimally choose blocking keys. Traditionally, the choice of blocking keys is made manually by domain and data linkage experts. Two approaches based on supervised learning of blocking keys have recently been presented. They employ predicate-based formulations of learnable blocking functions [5] and the sequential covering algorithm which discovers disjunctive sets of rules [5], respectively. The aim of both approaches is to find blocking keys such that the number of true matches in the candidate record pairs is maximised, while keeping the total number of candidate pairs as small as possible. They rely on training examples, i.e. pairs of true matched and true non-matched record pairs, which are often not available in real world situations, or have to be prepared manually (an expensive and time consuming process).. Contributions We have recently presented an overview and evaluation of six blocking techniques in [9], where we implemented a modification of two techniques, replacing global threshold values with nearest-neighbour parameters. This resulted in significantly more true matched record pairs being generated by the blocking process, and also much improved robustness of these blocking techniques to changes in parameter settings. We discussed the quality of the candidate record pairs generated by the various evaluated blocking techniques. In this paper, we focus on the performance and scalability of various blocking techniques. Specifically, we evaluate how processing times and amounts of main memory required scale with the size of the databases to be linked or deduplicated. To the best of our knowledge, no such experimental evaluation of different blocking techniques within a common framework has been published. We are only aware of one earlier study [3], similar to our experiments presented in [9], that compared the quality of candidate record pairs produced by three blocking techniques.. BLOCKING TECHNIQUES When linking large databases, blocking is essential in order to make data linkage and deduplication possible at all, and to improve the efficiency of the linkage process. Blocking, however, will reduce the linkage quality [], as it is very likely that some true matched record pairs will be removed by the blocking process, if records are not being inserted into the correct block (or blocks) due to variations and errors in their blocking key values. The blocking step from Figure can be split into the following two sub-steps.. Build: All records from the database (or databases) are read, the blocking key values are created, and the records are inserted into a suitable index data structure. For most blocking techniques, a basic inverted index [3] can be used. The blocking key values will become the keys of the inverted index, and the record identifiers of all records that have the same blocking key value will be inserted into the same inverted index list. The record attribute values required in the comparison step will be inserted into another index data structure, for example a hash-table with record identifiers as keys (which can be main memory or disk based). Additional data structures will be built as required by a blocking technique (as described below).. Retrieve: Record identifiers are retrieved from the index data structure block by block, and the corresponding record attribute values required for the comparisons are retrieved. Candidate record pairs are then generated from these records. For linkage, all records in a block from one database will be paired with all records in the same block from the other database, while for deduplication each record in a block will be paired with all other records in the same block. The generated candidate record pairs are then compared, and the resulting vectors containing the numerical comparison values are given to a classifier. In this paper, we are interested in performance and scalability of the blocking process, i.e. how many candidate record pairs are generated, how many of these are true matches, how much memory is required by the index data structures, and how long it takes to generate the candidate record pairs. For some blocking techniques, the above two steps, together with the record pair comparisons, can be performed in an overlapping fashion, as discussed in detail in Section.6 below. Others require that the Build step is completed before the Retrieve step can be started. We will use the following notation: n A and n B are the total number of records in databases A and B. The average length in characters of blocking key values is denoted with c, while r is the length of a record identifier, p the length of a pointer, and f the length of a floating-point number (all in bytes). B A and B B are the sets of unique blocking key values generated by a blocking technique from the records in databases A and B, respectively. The function block(a, b i) returns the record identifiers from a block with blocking key value b i B A from database A. The number of elements in a set is denoted with.. Standard blocking This technique has been used in data linkage for many years [7]. All records having the same value in a blocking key are inserted into the same block, and only records within the same block are then compared with each other. Each record will be inserted into one block only, and thus n A = P b i B A block(a, b i) and n B = P b j B B block(b, b j). The number of record pairs generated by standard blocking for linkage is X b i (B A B B ) and for deduplication it is X block(a, b i) block(b, b i), b i B A block(a, b i) ( block(a, b i) )/. Standard blocking can be implemented efficiently using an inverted index [3], as described in the Build step in Section above. In the Retrieve step, all record identifiers from a single inverted index list are extracted and record pairs are generated from them.

4 The two major drawbacks of standard blocking are errors in the blocking key values that will result in records being inserted into the wrong block, and the sizes of the blocks generated. As these sizes depend upon the frequency distributions of the blocking key values, it is difficult to predict the total number of candidate record pairs generated by this blocking technique. In the ideal situation, where all blocks contain the same number of records, i.e. block(a, b i) = n A/ B A for all b i B A (and similarly for database B), the number of candidate record pairs generated for linkage will be [5] B A B B na B A nb B B, and for deduplication it will be B A na B A na B A «/. This situation will be very hard to achieve in practice, and thus the number of pairs generated will likely be larger, as larger blocks will generate an increased number of pairs. The memory requirement for the inverted index in standard blocking is (c+p)( B A + B B )+r(n A +n B) for linkage and (c + p) B A + rn A for deduplication.. Sorted neighbourhood First proposed in the mid 99s [], the basic idea behind this technique is to sort the blocking key values once the basic inverted index has been created, and to move a window of fixed size w sequentially over these sorted values. Sorting the blocking key values (which will require O( B A log( B A )) and O( B B log( B B )) steps) and using a window of size w > will insert records into the same block that have similar, not just exactly the same, blocking key values. Blocks are formed from the record identifiers in the inverted index lists of all blocking key values in the current window. We denote the union of all blocking key values from two databases A and B with B AB = B A B B, and the elements i to j (with i j) in the sorted blocking key values from both databases with B AB[i, j]. The number of candidate record pairs generated by this blocking technique with a window size w for linkage is B AB P w+ k= and for deduplication it is B A P w+ k= [ [ b i B AB [k,k+w ] [ b j B AB [k,k+w ] b i B A [k,k+w ] b i B A [k,k+w ] block(a, b i) block(b, b j), block(a, b i) block(a, b i) A /. If the window size w =, the sorted neighbourhood technique becomes standard blocking as described above. There- Note that our implementation of this technique is different from the one originally presented [], as we only sort the blocking key values rather than all the records in the database. fore, for all window sizes w >, the generated record pairs will be a super-set of the pairs generated by standard blocking. In general, for two window sizes w i and w j, with w i < w j, all record pairs generated with window size w i will also be in the pairs generated with w j. However, the larger the window size, the larger the blocks become. A disadvantage of this method is that, similarly to standard blocking, larger blocks will dominate the performance, as large numbers of record pairs will be generated. In the ideal situation, where all blocks contain the same number of records, the number of candidate record pairs generated for linkage will be [3, 5] ««( B AB w + ) w na w nb, B A B B and for deduplication it will be ««( B A w + ) w na w na B A B /. A The other major disadvantage is that the sorting process assumes that the beginning of blocking key values is errorfree (especially the initial character), as otherwise sorting will move similar values away from each other. For example, if the blocking key values are given names, christina and kristina will very likely be too far away in the sorted values to be covered by the same window, even though they are very similar. As with standard blocking, it is good practice to define several blocking keys, resulting in several sorted list of values [], and to use pre-processing, like phonetic encodings [8], to bring similar values close together. The memory requirement for the inverted index in the sorted neighbourhood blocking technique is the same as with standard blocking for both linkage and deduplication..3 Canopy clustering The idea behind this recently developed technique [4, 4] is to use a computationally cheap similarity measure to efficiently construct high-dimensional, overlapping clusters, called canopies, and to then extract blocks from these clusters. As cheap similarity measure, the Jaccard or TF- IDF/cosine similarities (as used in information retrieval) [3] can be used. Both are based on q-grams (or more generally, tokens [4]) and can be implemented efficiently using an inverted index with the q-grams (rather than blocking key values) as index keys. For both similarity measures, in the Build step the blocking key values are first converted into q-gram lists, and then each q-gram is inserted into an inverted index. For TF-IDF/cosine similarity additional information has to be calculated and stored: for each unique q-gram the number of records that contain this q-gram, i.e. its term frequency (TF); and within the inverted index the document frequency (DF) for each q-gram in each record (i.e. the frequency of a q-gram in a blocking key value). Once all records in a database have been read and processed, the TF and DF values can be normalised and the inverse document frequency (IDF) can be calculated. No such frequency information or normalisation is required for Jaccard similarity. In the Retrieve step, all records are initially inserted into a pool of candidate records. Canopy clusters are then generated by randomly selecting a record (which will become the centroid of the cluster) from the pool, and adding all records from the pool into the cluster that are similar to

5 this centroid record. In the traditional approach [4, 4], all records closer than a loose similarity value threshold, t loose, are inserted into the canopy. Of these, all records within a tight similarity threshold t tight (with t tight t loose ), are removed from the candidate pool of records. This process is repeated until no candidate records are left in the pool. Note that if both thresholds t loose and t tight are set to a value of. (i.e. only exact similarity), the canopy clustering technique reduces to standard blocking. Similar to standard blocking and the sorted neighbourhood approach, canopy clustering with global thresholds t loose and t tight will result in blocks (i.e. canopy clusters) of different sizes (even though the TF-IDF/cosine similarity measure to some degree adjusts similarity values according to the frequency of the q-grams in the blocking key values). We have modified [9] the canopy clustering approach by replacing the two global thresholds with two neighbouring based parameters: n tight and n loose (with n tight n loose ), both being integer numbers. n loose is the number of closest records to the randomly chosen centroid record (according to the similarities of their blocking key values) that will be inserted into the canopy cluster, and the n tight closest records of these will then be removed from the pool of candidate records. This approach results in blocks of similar sizes, with the maximum size known beforehand as n loose. This not only prevents very large blocks, but also allows an estimate of the number of record pairs generated. Experiments [9] showed that using such nearest-neighbour parameters can result in a significant increase in the quality of the candidate record pairs generated (i.e. more true matches) compared to global thresholds, and also in a higher robustness of the canopy clustering blocking technique with regard to changes in parameter settings. Predicting the number of candidate record pairs generated by the global threshold based approach is very difficult, as it depends upon the actual values in the blocking keys as well as the chosen threshold values. In one extreme case, one can imagine that all values are very similar (for example having only one character difference) and thus all records are being inserted into the same canopy, while on the other hand all blocking key values can be very different from each other and result in similarities below t loose, so that each record forms its own block. Using nearest-neighbour based parameters, on the other hand, allows a better estimation of the number of record pairs generated, as the number of blocks (canopy clusters) corresponds to B A n tight and B B n tight, and each block will contain n loose records. If we denote the set of canopies in common by databases A and B as C, with C min( B A B n tight, B n tight ), and assuming each canopy contains n loose records from both databases, then the number of candidate record pairs generated for linkage is and for deduplication it is C (n loose ), B A n tight n loose (n loose )/. The memory requirement for the inverted index in canopy clustering is depending upon the number of unique q-grams. A blocking key value of length l characters contains (l q+) q-grams, and if Q is the set of all unique q-grams generated, the amount of memory required for Jaccard similarity is Q (p+q)+r(l q +)(n A +n B) for linkage and Q (p+q)+ r(l q + )n A for deduplication. The TF-IDF/cosine similarity additionally requires storage of DF and IDF values, which amounts to f( Q + (l q + )(n A + n B)) for linkage and f( Q + (l q + )n A) for deduplication..4 String map based blocking This technique [] is based on the idea of mapping the blocking key values (assumed to be strings) to objects in a multi-dimensional Euclidean space, such that similarities (or distances, like edit-distance [8]) between pairs of strings are preserved; followed by finding pairs of objects in this space that are similar to each other. In [], the authors modify the FastMap [6] algorithm into StringMap, which has a linear complexity in the number of strings to be mapped. This algorithm iterates over d dimensions; for each it finds two pivot strings and then forms orthogonal directions and calculates the coordinates of all other strings on these directions. In the second step, the authors use R-trees as multidimensional data structure in combination with a queue to efficiently retrieve pairs of similar strings []. Choosing an appropriate dimensionality d is done using a heuristic approach that tries a range of dimensions and selects the one that minimises a cost function (dimensions between 5 and 5 typically seem to achieve good results) []. In our implementation, we replaced the R-tree data structure with a grid based index [], as most tree-based multidimensional index structures degrade rapidly with increasing dimensionality. It is reported [] that with more than 5 to dimensions, in most tree based indices all objects in an index will be accessed when performing similarity searches. Our grid based index works by having a regular grid of dimensionality d implemented as an inverted index in each dimension (i.e. all objects mapped into the same grid cell in a dimension are inserted into the same inverted index list). The Retrieve step is implemented in a similar way as in canopy clustering described above. An object (blocking key value) is randomly picked from the pool of (initially all) objects, and the objects in the same, as well as in neighbouring grid cells, are retrieved from the index. Similar to canopy clustering, two thresholds t loose and t tight can be used to insert similar objects into clusters. Alternatively, nearestneighbour parameters n loose and n tight can be used in the same way as described with canopy clustering above [9]. Predicting the number of candidate record pairs generated by the global threshold based approach is very difficult, as it depends upon the distribution of the objects in the multidimensional space (which depend upon the values of the blocking keys), as well as the chosen threshold values. For the nearest neighbour based approach, however, the same estimations as described in Section.3 can be used. The memory requirement for the string map based blocking include the inverted index (which is the same as in standard blocking), plus the d dimensional array containing the multi-dimensional mappings of the blocking key values, and the index grid. The total memory requirement is (c + p + fd + cd)( B A + B B ) + r(n A + n B) for linkage and (c + p + fd + cd) B A + rn A for deduplication..5 Suffix array based blocking This technique has recently been proposed as an efficient domain independent method for multi-source information integration []. The basic idea is to insert the blocking key

6 Table : Number of candidate record pairs generated (smallest numbers per column in bold) Deduplication Blocking technique, 5,, 5, 5,, 5,, 5, 5,,8 3,537 89,43 545,69,66,57 3, 86,796 5,64,9,9 Sorted-Window-3,578 8,335 5,4,5,44 3,743,94,87 8,38 3,998,5,65 3,37,56 Sorted-Window-7 4,35 74,385 47,648,,54 6,749,68 4,644 74, 46,796,94,99 6,4,76 Suffix-Array 4,4,39 43,6 99,5 74,74 4,43,753 43,69 99,4 76,39 String-Map 6,843 83,46 9,39 748,564,53,8 4,96 75,58 73,85 694,594,4,64 Canopy-NN-Jaccard 8,75 5,565 39,73 648,5,37,84 8,3 5,694 3,74 66,7,7,636 Canopy-NN-TFIDF 5,8 8,535 89,53 745,38,59,895 3,5 73,338 73,789 69,98,6,88 Canopy-Thres-Jaccard,3 3,66 89, ,389,3,8,579 3,397 87,95 55,363,97,93 Canopy-Thres-TFIDF,776 4,37 57, ,59 3,996,899,9 37,96 48,34 93,74 3,464,999 Linkage Blocking technique, 5,, 5, 5,, 5,, 5, 5, 467 7,689 6,836 63,83 676, ,39 35,86,896 84,8 Sorted-Window-3 9,593 59,39 34,6 47,686,46,9 7,899 5,86 6, ,38,66,83 Sorted-Window-7,956 34,68 96,7 993,7,834,35 8,677,48 7, 969,579 3,93,3 Suffix-Array,587,3 9,3 44,977 8,6,434 6,869 3,65 73,8 36,5 String-Map 5,5 6,84 57,957 4, ,58 6,396 3,7 75,7 85,3 977,848 Canopy-NN-Jaccard,54 5,37 39,998 9,567 7,54 3,5 9,78 53,74 4,65 98,884 Canopy-NN-TFIDF 4,757 4,79 55,54,4 77,44 5,855 3,7 73,458 8,46 97,33 Canopy-Thres-Jaccard 468 7,749 6,963 64,38 68, 57 9,39 36,5 4, ,579 Canopy-Thres-TFIDF 637,435 47,766 87,575,8, 737 5,78 63,56 356,988,5,967 values and their suffixes into a suffix array based inverted index. A suffix array contains strings or sequences and their suffixes in a sorted order. Blocking based on suffix arrays has successfully been used on both English and Japanese bibliographic databases [], where suffix arrays were created using both English names and Japanese characters. For this blocking technique, only suffixes down to a minimum length, min len, are inserted into the suffix array. For example, for a blocking key value christen and a minimum length min len = 4, the values christen, hristen, risten, isten, and sten, will be inserted, and the identifiers of all records that have this blocking key value will be inserted into the corresponding inverted index lists. Similar to canopy clustering, the identifier of a record will be inserted into several blocks (i.e. inverted index lists), according to the length of its blocking key value. A blocking key value of length l characters will be inserted into (l min len + ) index lists. In order to limit the maximum block size, only values in the suffix array inverted index are used that have less than a maximum number, max block, of record identifiers in their corresponding list. For example, if the suffix array value ten has records in its corresponding inverted index list, and the value sten has only 5 records, and max block = 6, then ten is considered to be too general and is not used in the Retrieve step (when blocks are extracted and record pairs are generated), as it would generate too many pairs. Assuming B A is the set of unique blocking key values generated by a database A, then the minimum length of the suffix array is ( B A + l min len), which is the (unlikely) case that all blocking key values (assumed to be l characters long) only differ in their first character and have all of their suffixes in common (for example gillian, jillian and killian all have the same suffixes illian, llian, and so on). On the other hand, the maximum length of the suffix array is B A (c min len+), with c the average length of the blocking key values. This is the case where all suffixes (down to min len) of all blocking key values are different. Both these cases will unlikely happen in practice, so the length of the suffix array will be somewhere in between. The maximum number of candidate record pairs will be generated if all inverted index lists contain max block records, which means every entry in the suffix array has to appear in max block records. If S A and S B are the suffix arrays generated from the blocking key values B A and B B from databases A and B, respectively; and s i denotes an entry in these suffix arrays, then the number of candidate record pairs generated by suffix array based blocking is X block(a, s i) block(b, s i) s i (S A S B ) for linkage, and for deduplication it is X s i S A block(a, s i) ( block(a, s i) )/. With an average length of c characters of the original blocking key values, the average length of the suffix array entries becomes (c + ), as the shortest entries min len will be min len characters long. The memory requirement min len for suffix array based blocking therefore is (c + + p)( S A + S B ) + r(c min len + )(n A + n B) for linkage min len and (c + + p) S A + r(c min len + )n A for deduplication..6 Implementation issues For standard blocking and the sorted neighbourhood technique, the blocking sub-steps Build and Retrieve, together with the record pair comparison and classification steps, can be implemented in an overlapping fashion. For deduplication, an inverted index data structure is built while records are read from a database, their blocking key values are extracted, and then inserted into the index. The current record is compared with all previously read and indexed records having the same blocking key value. For linkage, the Big- Match [3] approach can be taken, where first the smaller

7 Table : Reduction ratio results (in percentages, largest values per column in bold) Deduplication Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF Linkage Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF database is loaded and the inverted index data structures are built in main memory, including all record attribute values required in the comparison step. Each record of the large database is then read, its blocking key values are extracted, and all records in the same block from the small database are retrieved from the index data structure and compared with the current record. This approach performs only one single pass over the large database and does not require indexing, sorting or storing of any of its records. The canopy clustering, string map and suffix array based blocking techniques do not allow such an overlapping implementation, as they require their data structure to first be completely built before blocks can be retrieved. An advantage of these three blocking techniques is, however, that their index data structures can either be kept in main memory or be stored on disk. This allows trading run times versus memory resources, as well as persistent storage of the blocking indices (and thus performing the Retrieve step many times without having to repeat the Build step). 3. EXPERIMENTAL EVALUATION The aim of the experiments described here was to empirically evaluate the performance (quality of the candidate record pairs generated) and scalability (processing times and main memory required) of several recently developed blocking techniques with regard to the size of the database to be deduplicated or linked. Using a common framework allowed us to investigate questions like: What are the memory requirements of blocking techniques? Which techniques are scalable to very large databases? Which technique performs best for databases with certain error characteristics? Here we will present results for the total number and quality of the candidate record pairs generated, as well as the time and main memory required by the different blocking techniques. We have chosen parameter settings that have achieved good experimental results earlier [9]. We have evaluated the blocking techniques described in Sections. to.5 above using data sets of different sizes and with different error characteristics, as described below. All blocking techniques were implemented in the Febrl [] record linkage system, which is written in the Python programming language. The experiments were carried out on a Dell Optiplex GX8 with a Intel Pentium 3 GHz CPU and Gigabytes of main memory, running Linux.6.8 and using Python Test data sets and blocking key definition In order to evaluate how the quality (number of true matches in the candidate record pairs) and scalability (total number of candidate record pairs) of a blocking technique is affected by the size and error characteristics of the databases to be linked, we created synthetic data sets containing from, to 5, records using the Febrl data set generator [7]. This generator works by first creating original records based on frequency tables containing real world names (given- and surname) and addresses (street number, name and type; postcode; suburb and state names), followed by the random generation of duplicates of these records based on modifications (like inserting, deleting or substituting characters, and swapping, removing, inserting, splitting or merging words), also based on real error characteristics. We generated two data sets for deduplication and two for linkage for each size, with different error characteristics as described below. For the linkage data sets, the original and duplicate records were separated into two files. Clean data sets: 8% original and % duplicate records; up to three duplicates for one original record, maximum one modification per attribute, and maximum three modifications per record. Dirty data sets: 6% original and 4% duplicate records; up to nine duplicates for one original record,

8 Table 3: Pairs completeness results (in percentages, largest values per column in bold) Deduplication Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF Linkage Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF maximum three modifications per attribute, and maximum ten modifications per record. We ran experiments with various blocking key definitions, and here we present results from the best performing definition [9], where three different blocking keys were defined, corresponding to three runs (or iterations) in traditional blocking [7]. The first are the Soundex [8] encoded surnames, the second are the first four characters of given names concatenated with the first two digits of postcodes, and the third blocking key definition are the two last postcode digits concatenated with Double-Metaphone [8] encoded suburb names. 3. Quality and complexity measures The quality and complexity of blocking techniques have traditionally been evaluated using the pairs completeness and reduction ratio measures [, 5], as defined below. Following [5], let n M and n U be the total number of matched and un-matched record pairs, respectively, such that n M + n U = A B for linkage of two databases A and B, and n M + n U = A ( A )/ for deduplication of a database A. Next, let s M and s U be the number of true matched and true non-matched record pairs generated by a blocking technique, respectively, with usually (s M + s U) (n M + n U). P C = sm n M RR =. (sm + su) (n M + n U ) Pairs completeness (PC) is the number of true matched record pairs generated by a blocking technique divided by the total number of true matched pairs. It measures how effective a blocking technique is in generating true matched record pairs. Pairs completeness corresponds to the recall measure as used in information retrieval [3]. The reduction ratio (RR) measures the reduction of the comparison space, i.e. the more record pairs are removed by a blocking technique the higher the reduction ratio value becomes. However, reduction ratio does not take the quality of the generated candidate record pairs into account (how many are true matches or not). Besides RR and PC, we also report the total number of candidate record pairs generated, the total processing times used, and the amount of main memory required by a technique for both the blocking and comparison steps. None of the computational resources is taken into account by either RR or PC. 3.3 Experimental results We ran blocking experiments with the four different data set series described above on all blocking techniques presented in Section. Tables to 5 show the number of candidate record pairs generated, reduction ratio, pairs completeness, amount of main memory required, and times used. We selected parameter values that are either commonly used (like -grams) or that showed good results in our preliminary experiments compared to other settings for the same technique [9]. For the sorted neighbourhood approach, we choose window sizes w = 3 and w = 7, for the suffix array based approach we selected the minimum value length min len = 3 and the maximum block size block size =, and for string map blocking we chose a dimensionality d = 5 and two nearest-neighbour based parameter values n loose = 8 and n tight = 4. All four canopy clustering based techniques presented are based on -grams (q = ), with nearest neighbour parameter values n loose = 8 and n tight = 4, and threshold values t loose =.8 and t tight =.9. The best result (largest percentages and smallest absolute values) in each table column is highlighted in boldface. Missing results indicate that these experiment were computationally not feasible, as more than one million record pairs were generated and more than Gigabytes of main memory was required, resulting in swapping.

9 Table 4: Main memory required (in Megabytes, smallest numbers per column in bold) Deduplication Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window , ,8 Sorted-Window , ,657 Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF Linkage Blocking technique,,5 5,, 5, 5,,,5 5,, 5, 5, Sorted-Window Sorted-Window Suffix-Array String-Map Canopy-NN-Jaccard Canopy-NN-TFIDF Canopy-Thres-Jaccard Canopy-Thres-TFIDF DISCUSSION Looking at Table it is clear that the various blocking techniques generate very different numbers of candidate record pairs. With the exception of suffix array based blocking, all recently developed techniques generate more pairs than standard blocking, which is due to them inserting a record identifier into more than one block. This can best be seen by the sorted neighbourhood approach, which generates more pairs as larger window sizes are used. With canopy clustering, using Jaccard similarity results in significantly fewer record pairs compared to using TF-IDF/cosine similarity for both global threshold and nearest-neighbour parameters, with generally the global thresholds method generating fewer pairs than the nearest-neighbour approach. The reduction ratio (RR) results are correlated with the number of record pairs. For all blocking techniques, and for both deduplication and linkage, the RR values are very high for all data set sizes. Having similar RR values for all data set sizes means that the number of candidate record pairs generated is increasing at a quadratic rate, making a blocking technique not scalable. For most techniques, however, RR values are actually increasing slightly with larger data sets, which means these blocking techniques are reducing the complexity of the blocking process to below quadratic. The quality of the candidate record pairs generated is shown in Table 3 using the pairs completeness (PC) measure. With increasingly large data sets, the PC results for many blocking techniques decrease, especially for the dirty data sets that contain more modifications. For the suffix array and sorted neighbourhood techniques these drops in PC are significant, and indicate that they are not capable to achieve good quality linkage results with large databases that containing many errors and modifications. Canopy clustering with global thresholds seems to be the only technique that achieves high PC results for all data set sizes, with TF- IDF/cosine similarity slightly outperforming Jaccard similarity. Surprisingly, the standard blocking approach also results in fairly good PC results over all data sets sizes. The main memory and timing results presented in Tables 4 and 5 are the totals for both the blocking and comparison steps and include the memory used by the record pair comparison vectors. They thus give an upper limit of the resources required, as the comparison vectors could be stored on disk and be re-loaded for the classification step (resulting in longer run times). Main memory required correlates with the number of candidate record pairs generated, and increases more than linearly with increasing data set sizes. The sorted neighbourhood techniques requires most memory (with increasing window size) as it generates the largest numbers of candidate record pairs, followed by string map based blocking. For canopy clustering, TF-IDF/cosine similarity uses more memory due to the extra storage required for term and inverse document frequencies of q-grams. For deduplication, suffix array based blocking is by far the fastest method compared to all others (it however generates the smallest number of record pairs and has the lowest PC results), followed by standard blocking and threshold based canopy clustering using Jaccard similarity; while for linkage standard blocking is the fastest technique, followed by threshold based canopy clustering using Jaccard similarity and suffix array based blocking. String map based blocking is significantly slower than all other techniques, especially for larger data sets. Jaccard similarity outperforms TF- IDF/cosine similarity based canopy clustering in all cases, as it allows pruning of candidate records early in the similarity calculation compared to TF-IDF/cosine similarity [3]. It would thus be the preferred method, compared to TF- IDF/cosine similarity, as both have comparable PC results. To investigate the scalability of the presented blocking techniques, in Figures and 3 we show results for the time required for a single record pair comparison (i.e. the timing results from Table 5 divided by the number of candidate record pairs generated from Table ). As can be seen, most

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,

More information

Data Linkage Techniques: Past, Present and Future

Data Linkage Techniques: Past, Present and Future Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Peter Christen Ross Gayler 2 Department of Computer Science, The Australian National University, Canberra 2,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Privacy-Preserving Data Sharing and Matching

Privacy-Preserving Data Sharing and Matching Privacy-Preserving Data Sharing and Matching Peter Christen School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact:

More information

Efficient Record De-Duplication Identifying Using Febrl Framework

Efficient Record De-Duplication Identifying Using Febrl Framework IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 10, Issue 2 (Mar. - Apr. 2013), PP 22-27 Efficient Record De-Duplication Identifying Using Febrl Framework K.Mala

More information

Adaptive Temporal Entity Resolution on Dynamic Databases

Adaptive Temporal Entity Resolution on Dynamic Databases Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National

More information

A Survey on Removal of Duplicate Records in Database

A Survey on Removal of Duplicate Records in Database Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,

More information

A fast approach for parallel deduplication on multicore processors

A fast approach for parallel deduplication on multicore processors A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Single Error Analysis of String Comparison Methods

Single Error Analysis of String Comparison Methods Single Error Analysis of String Comparison Methods Peter Christen Department of Computer Science, Australian National University, Canberra ACT 2, Australia peter.christen@anu.edu.au Abstract. Comparing

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Probabilistic Deduplication, Record Linkage and Geocoding

Probabilistic Deduplication, Record Linkage and Geocoding Probabilistic Deduplication, Record Linkage and Geocoding Peter Christen Data Mining Group, Australian National University in collaboration with Centre for Epidemiology and Research, New South Wales Department

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

A Review on Unsupervised Record Deduplication in Data Warehouse Environment A Review on Unsupervised Record Deduplication in Data Warehouse Environment Keshav Unde, Dhiraj Wadhawa, Bhushan Badave, Aditya Shete, Vaishali Wangikar School of Computer Engineering, MIT Academy of Engineering,

More information

Outline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I)

Outline. Probabilistic Name and Address Cleaning and Standardisation. Record linkage and data integration. Data cleaning and standardisation (I) Outline Probabilistic Name and Address Cleaning and Standardisation Peter Christen, Tim Churches and Justin Xi Zhu Data Mining Group, Australian National University Centre for Epidemiology and Research,

More information

CHAPTER 7 CONCLUSION AND FUTURE WORK

CHAPTER 7 CONCLUSION AND FUTURE WORK CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Overview of Record Linkage Techniques

Overview of Record Linkage Techniques Overview of Record Linkage Techniques Record linkage or data matching refers to the process used to identify records which relate to the same entity (e.g. patient, customer, household) in one or more data

More information

Febrl Freely Extensible Biomedical Record Linkage

Febrl Freely Extensible Biomedical Record Linkage Febrl Freely Extensible Biomedical Record Linkage Release 0.4.01 Peter Christen December 13, 2007 Department of Computer Science The Australian National University Canberra ACT 0200 Australia Email: peter.christen@anu.edu.au

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Preview. Memory Management

Preview. Memory Management Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Learning Blocking Schemes for Record Linkage

Learning Blocking Schemes for Record Linkage Learning Blocking Schemes for Record Linkage Matthew Michelson and Craig A. Knoblock University of Southern California Information Sciences Institute, 4676 Admiralty Way Marina del Rey, CA 90292 USA {michelso,knoblock}@isi.edu

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Sorted Neighborhood Methods Felix Naumann

Sorted Neighborhood Methods Felix Naumann Sorted Neighborhood Methods 2.7.2013 Felix Naumann Duplicate Detection 2 Number of comparisons: All pairs 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 400 comparisons 12

More information

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Ben Karsin University of Hawaii at Manoa Information and Computer Science ICS 63 Machine Learning Fall 8 Introduction

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit

Record Linkage. with SAS and Link King. Dinu Corbu. Queensland Health Health Statistics Centre Integration and Linkage Unit Record Linkage with SAS and Link King Dinu Corbu Queensland Health Health Statistics Centre Integration and Linkage Unit Presented at Queensland Users Exploring SAS Technology QUEST 4 June 2009 Basics

More information

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

A Session-based Ontology Alignment Approach for Aligning Large Ontologies Undefined 1 (2009) 1 5 1 IOS Press A Session-based Ontology Alignment Approach for Aligning Large Ontologies Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University,

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Learning independent, diverse binary hash functions: pruning and locality

Learning independent, diverse binary hash functions: pruning and locality Learning independent, diverse binary hash functions: pruning and locality Ramin Raziperchikolaei and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

The Data Mining Application Based on WEKA: Geographical Original of Music

The Data Mining Application Based on WEKA: Geographical Original of Music Management Science and Engineering Vol. 10, No. 4, 2016, pp. 36-46 DOI:10.3968/8997 ISSN 1913-0341 [Print] ISSN 1913-035X [Online] www.cscanada.net www.cscanada.org The Data Mining Application Based on

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Tuning Large Scale Deduplication with Reduced Effort

Tuning Large Scale Deduplication with Reduced Effort Tuning Large Scale Deduplication with Reduced Effort Guilherme Dal Bianco, Renata Galante, Carlos A. Heuser Universidade Federal do Rio Grande do Sul Porto Alegre, RS, Brazil {gbianco,galante,heuser}@inf.ufrgs.br

More information

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises 308-420A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises Section 1.2 4, Logarithmic Files Logarithmic Files 1. A B-tree of height 6 contains 170,000 nodes with an

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski nargle@cs.wisc.edu 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution

A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution A Two Stage Similarity aware Indexing for Large Scale Real time Entity Resolution Shouheng Li Supervisor: Huizhi (Elly) Liang u4713006@anu.edu.au The Australian National University Outline Introduction

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

So, we want to perform the following query:

So, we want to perform the following query: Abstract This paper has two parts. The first part presents the join indexes.it covers the most two join indexing, which are foreign column join index and multitable join index. The second part introduces

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Similarity Search: A Matching Based Approach

Similarity Search: A Matching Based Approach Similarity Search: A Matching Based Approach Anthony K. H. Tung Rui Zhang Nick Koudas Beng Chin Ooi National Univ. of Singapore Univ. of Melbourne Univ. of Toronto {atung, ooibc}@comp.nus.edu.sg rui@csse.unimelb.edu.au

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Entity Resolution over Graphs

Entity Resolution over Graphs Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang,

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Collaborative Filtering using a Spreading Activation Approach

Collaborative Filtering using a Spreading Activation Approach Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report

Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report Hana Lee (leehana@stanford.edu) December 15, 2017 1 Summary I implemented a SAT solver capable of solving Sudoku puzzles using

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

School of Computer and Information Science

School of Computer and Information Science School of Computer and Information Science CIS Research Placement Report Multiple threads in floating-point sort operations Name: Quang Do Date: 8/6/2012 Supervisor: Grant Wigley Abstract Despite the vast

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017 CPSC 340: Machine Learning and Data Mining Finding Similar Items Fall 2017 Assignment 1 is due tonight. Admin 1 late day to hand in Monday, 2 late days for Wednesday. Assignment 2 will be up soon. Start

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

A Probabilistic Deduplication, Record Linkage and Geocoding System

A Probabilistic Deduplication, Record Linkage and Geocoding System A Probabilistic Deduplication, Record Linkage and Geocoding System http://datamining.anu.edu.au/linkage.html Peter Christen 1 and Tim Churches 2 1 Department of Computer Science, Australian National University,

More information

COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA

COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA Vincent W. Zheng, Yu Zheng, Xing Xie, Qiang Yang Hong Kong University of Science and Technology Microsoft Research Asia WWW 2010

More information

Edge Classification in Networks

Edge Classification in Networks Charu C. Aggarwal, Peixiang Zhao, and Gewen He Florida State University IBM T J Watson Research Center Edge Classification in Networks ICDE Conference, 2016 Introduction We consider in this paper the edge

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

A Re-examination of Limited Discrepancy Search

A Re-examination of Limited Discrepancy Search A Re-examination of Limited Discrepancy Search W. Ken Jackson, Morten Irgens, and William S. Havens Intelligent Systems Lab, Centre for Systems Science Simon Fraser University Burnaby, B.C., CANADA V5A

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016 edestrian Detection Using Correlated Lidar and Image Data EECS442 Final roject Fall 2016 Samuel Rohrer University of Michigan rohrer@umich.edu Ian Lin University of Michigan tiannis@umich.edu Abstract

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Evaluating Record Linkage Software Using Representative Synthetic Datasets

Evaluating Record Linkage Software Using Representative Synthetic Datasets Evaluating Record Linkage Software Using Representative Synthetic Datasets Benmei Liu, Ph.D. Mandi Yu, Ph.D. Eric J. Feuer, Ph.D. National Cancer Institute Presented at the NAACCR Annual Conference June

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

UNIT III BALANCED SEARCH TREES AND INDEXING

UNIT III BALANCED SEARCH TREES AND INDEXING UNIT III BALANCED SEARCH TREES AND INDEXING OBJECTIVE The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions and finds in constant

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

1-Nearest Neighbor Boundary

1-Nearest Neighbor Boundary Linear Models Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We would like here to draw a linear separator, and get so a

More information

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM CHAPTER 4 K-MEANS AND UCAM CLUSTERING 4.1 Introduction ALGORITHM Clustering has been used in a number of applications such as engineering, biology, medicine and data mining. The most popular clustering

More information

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS

A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS A GENETIC ALGORITHM FOR CLUSTERING ON VERY LARGE DATA SETS Jim Gasvoda and Qin Ding Department of Computer Science, Pennsylvania State University at Harrisburg, Middletown, PA 17057, USA {jmg289, qding}@psu.edu

More information