A Fast Linkage Detection Scheme for Multi-Source Information Integration

Size: px
Start display at page:

Download "A Fast Linkage Detection Scheme for Multi-Source Information Integration"

Transcription

1 A Fast Linkage Detection Scheme for Multi-Source Information Integration Akiko Aizawa National Intsitute of Informatics / The Graduate University for Advanced Studies Hitotsubashi, Chiyoda-ku, Tokyo , Japan aizawa@nii.ac.jp Keizo Oyama National Intsitute of Informatics / The Graduate University for Advanced Studies Hitotsubashi, Chiyoda-ku, Tokyo , Japan oyama@nii.ac.jp Abstract Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using blocking keys extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper. 1. Introduction Record linkage is a problem of identifying a group of records from different sources that refer to the same realworld entity. The term record linkage was used as early as the 1940s by Dunn [7] and Marshall [20], followed by the pioneering work by Newcombe et al. in 1959 [26]. Recently, the explosive growth of database resources, particularly the increase of legacy databases on the World Wide Web, has made the problem one of the central issues in data warehousing and data quality management. Many recent studies, such as [3], have pointed out that similar types of problem have been described differently in different research fields, for example, duplicate detection [24, 2] or deduplication [27], which emphasizes the elimination of duplications in a database; record matching [31] or approximate string join [10], which is an operation to integrate two separate tables; entity-name clustering and matching [4], object consolidation [23] or proper noun coreference analysis [22], which concerns the coreference determination problem in text processing; and entity identification [18], object identification [30], entity matching [5] or entity reconciliation [6], which concerns heterogeneous data integration. Table 1 illustrates the case where two bibliographic records refer to the same article, but with different attribute values. As shown in this example, real-world records usually contain noise, caused by such factors as notational variations, human input errors, or differences in editing policies. Therefore, constructing a practical linkage system often means the implementation of domain-specific knowledge to normalize the attribute values, or to evaluate the degree of match between the two records. In addition, the cost often becomes impracticably high, particularly for large databases that have been generated independently. 1. TAGUCHI Isao Observation of Nonlinear Conduction in (NbSe 4) 3I : II. LOW TEMPERATURE PROPERTIES OF SOLIDS : Charge Density Waves Japanese Journal of applied physics. Supplement vol. 26(3.pt1), pp (1987) 2. TAGUCHI I Observation of nonlinear conduction in (NbSe4)3I. Jpn J Appl Phys Suppl no.26-3 Pt.1, pp (1987) Table 1. Two bibliographic records that refer to the same article. There has been remarkable progress in recent years in approaches from the information retrieval (IR) and machine

2 learning (ML) fields [14]. Examples include the use of bigrams or Q-grams for possible linkage detection [3, 12, 11], ranking and scoring using the tf-idf similarity measure [4], applying learnable edit distance in matching score calculation [2], and the active learning adaptation to improve iteratively the performance of the linkage system [27]. It has been reported that these techniques enable less domaindependent linkages while maintaining performance. Our research was originally motivated by our experience with CiNii (Citation Information provided by National Institute of Informatics, where several bibliographic databases are integrated into a unified nationwide service for academic papers and their citations. Although the current implementation of CiNii employs the most modern techniques, including bigram-based search and support vector machines (SVMs), what has been learned in previous operations is that the system still requires considerable human review to maintain the quality of linkages. We seek a quality of: (i) less than 0.1% false matches and (ii) 1% missing linkages for databases with records. Based on this background, this paper aims to establish a methodology to use and exploit recent IR and ML techniques for large-scale linkage problems with human review. Although our final goal is the linkage of general Web resources, for experimental purposes our focus in this paper is primarily on the linkage of legacy databases. In the future, we plan to combine the proposed method with existing text segmentation techniques, such as [29] and deal with a broader range of data resources. The remainder of the paper is organized as follows: In section 2, a general procedure for record linkage is introduced and, focusing on the preselection process (referred to as blocking in this paper), a brief overview of existing methods is given. In section 3, we propose a new blocking scheme that is characterized as follows. First, it exploits a suffix array structure to enumerate possible linkage candidates rapidly using variable length n-grams. Second, it dynamically generates blocks of records that are possibly linked to each other using blocking keys extracted from the already known reliable linkages. Section 4 presents the results from our preliminary experimental studies, where the proposed method was applied to the integration of four bibliographic databases that scale up to more than 10 million entries. Section 5 presents our conclusions. 2. Background Issues 2.1. General Procedure for Record Linkage The records in this paper are the entries in databases with known, but not necessarily common, attributes. The two general requirements for record linkage systems are: (i) efficiency and scalability to deal with large-scale data, and (ii) accuracy to maintain the high quality of the data. However, these two requirements are contradictory and cannot be satisfied easily using a single-path method. It follows that record linkage systems in general employ multi-stage processing that includes (i) selection of possible linkages, (ii) comparison and judgment, and (iii) human review (Figure 1). Each is briefly explained below. (1) Selection Given a target dataset, record pairs that possibly refer to the same entities are enumerated first. This stage is also called blocking in this paper. (2) Comparison and judgment All the selected candidate pairs are labeled either as match, possible match, or non-match based on the matching score calculated using a predefined matching function. Here, match and non-match are the decisions with high confidence, and possible match indicates that the classifier is uncertain whether they represent correct linkages. (3) Human review Record pairs that are labeled as possible match are examined by human reviewers. After stage (3), the detected duplicates are either the pairs that are automatically labeled as match at stage (2) or the pairs that are identified as positive by human reviewers at stage (3). Selection Comparison & judgement Human review Figure 1. General record linkage procedure. Throughout the process, the most time-consuming stage is the human review, stage (3). For example, in our system, a decision requires seconds for a trained reviewer whether the result is positive or negative. It follows that the previous stages should be carefully designed so that they can significantly reduce the number of human judgments. For stage (2), to determine the optimal decision boundaries application of conventional machine learning techniques, such as SVM, is straightforward. Then, the critical issue becomes reducing the number of candidate linkages at stage (1) while maintaining as many positives as possible. Note

3 Sort key Block Sorted list of all the records Submit a record as a query Database Records with the same key value (a) Sorting method Possible duplicates Ranking of similar records Possible duplicates (b) Ranking method Figure 2. Sorting and ranking methods. that applying the classifier to all the possible combinations of records is obviously not feasible. For example, record linkage between two databases each with 10 6 records yields = times comparisons of record pairs. Based on this, we specifically focus on the selection strategy at stage (1) Overview of Existing Selection Methods The basic strategy for candidate selection is to divide the target records into small blocks so that only pairs in the same block are considered at the later stages. The existing selection methods can be categorized into two: (a) sorting methods and (b) ranking methods. Each is briefly reviewed below. Sorting methods Sorting methods are the traditional blocking schemes where the sorted list of records is first generated and then decomposed into different blocks (Figure 2-(a)). The simplest is standard blocking [17] where records are first sorted using a user-specified key (for example, the top four characters of one s family name ), and then, records with the same key value are grouped into a single block for further consideration. Multiple pass blocking [17] is where multiple keys are used for sorting. SNM (sorted neighborhood method) is a method that slides a fixed size window on the sorted list assuming that neighbors on the sorted list are also similar [15][16]. SNM has variations such as clustering SNM or multi-pass SNM. Other sorting methods include priority queue method [25], k-way sorting method [9], or adaptive filtering method [13]. Ranking methods Ranking methods consider the records as plain text and generate clusters of similar records by applying conventional information retrieval techniques (Figure 2-(b)). TI-similarity [28] first calculates the number of common characters for each attribute and then uses the summed value as a similarity measure. Bigram indexing [3] converts the key value into a set of character-based bigrams and then uses the permutations of subsets as keys for generating blocks. The lower bound of the subset size is determined by a specified threshold value. Positional q- gram [10, 12, 11] enables fast approximate string joins between records with a specified edit distance threshold value. Canopy clustering [4, 21] randomly selects a new record and generates a cluster of similar records using the tf-idf distance measure, widely used in IR. The generated clusters are called canopies and constitute blocks for further examination Observations The sorting methods require domain-dependent knowledge to select appropriate keys and, for SNM, the optimal window size. Alternatively, the ranking methods do not require specific domain knowledge because they treat all the key attribute values as text. Nevertheless, the performance of the sorting methods are not necessarily better than the ranking methods because neighbors on the sorted list do not usually cover all the similar records even when multiple pass sorting is applied. A previous study [1] compared the performance of standard blocking, SNM, bigram indexing, and canopy clustering. It reported that the latter two, today s representative ranking methods, significantly outperform the former, the traditional sorting methods. The computational cost for the sorting methods is simply that of sorting all the target records. Moreover, the sorting Histogram of Matching Score False Non-Match 10 True Non-Match match non-match Possible Match Matching Score True Match False Match Figure 3. Distribution of the matching score.

4 Fast blocking method Identify duplicate records with high confidence Extraction of blocking keys Neighbors of the identified record pairs Matching pattern Generate corresponding blocks Step.1 Suffix array based blocking Step.2 Key based blocking Figure 4. Basic idea of the proposed method. operation is applied only once for the entire set of records. On the other hand, the cost for the ranking methods is the cost of indexing, retrieving, and ranking the records. The retrieval operation must be applied iteratively for all the possible blocks, the number of which is approximately in proportion to the total number of records. Therefore, even with today s search engine techniques, the cost for ranking methods becomes significant for large-scale data, due to the exponential property of frequencies known as Zipf s Law. Another important issue is the cost of human review. As an illustrative example, Figure 3 shows the histogram of the matching scores obtained from the actual data used in our experiments. Assume that we choose T µ and T λ as boundaries for three decision regions, non-match, possible match, and match. As has been already described, the candidates labeled as possible match require human review, and the left- and right-side colored regions represent different types of linkage errors, i.e., false non-match and false match, respectively. From the figure, it becomes clear that the best performance we can expect from a classifier is determined a priori by the score distribution after the selection. Obviously, the total number of human judgments is most sensitive to the number of less-promising negative candidates (note that the vertical axis of Figure 3 is logscaled and the majority of the samples are negative). However, neither conventional sorting nor ranking methods explicitly considers the elimination of these negative candidates. 3. The Proposed Method 3.1. Basic Idea To summarize the above observations, the advantages and disadvantages of the existing methods are as follows. The sorting methods are scalable to the data size, but the ranking methods perform better in terms of the quality of the detection. For both methods, the cost of human review cannot be significantly reduced using conventional classifiers. Based on this, we propose a new selection method that combines two newly developed blocking schemes: suffix arraybased blocking and key-based blocking (Figure 2.2). Suffix array-based blocking provides a fast and domain-independent selection function by exploiting variable length n-grams on a suffix array representation. This enables us to identify, quickly though not exhaustively, linkage candidate pairs. Then, the detected pairs are evaluated using a specified scoring function to identify reliable positive pairs. Key-based blocking provides a way to resample the candidates using matching patterns, called blocking keys, obtained from the reliable pairs. Records that match the same blocking key are grouped into the same block for further comparison and judgment. In the proposed method, suffix array-based blocking is used to extract domain-specific transformation rules for the keybased blocking, and key-based blocking is used to reactivate candidate pairs that were overlooked by the suffix arraybased blocking. By this combination, the majority of the negative candidates shown in Figure 2.2 can be eliminated. Each blocking scheme is explained in the following Suffix Array-Based Blocking A suffix array is a sorted list of all the subsequences of tokens that appear in the text (Figure 3.1). Given a set of records for linkage, the corresponding suffix array structure is generated as follows. (1) Tokenize all the key attribute values. Tokens are either characters or terms, depending on the types of the attributes and/or languages.

5 The list of sorted pointers Suffix array structure Unigram Bigram Trigram... n-gram Figure 5. Suffix array-based blocking. A block of records that contain the sequence "adaptive duplicate detection" Filtering Number of records in the block The longest match The total match count (2) Represent each record as a single token sequence. Here, the key attributes are arranged in a specified order with delimiters between them. Start and stop symbols are added at the beginning and the end of the sequence. (3) Sort all the pointers of the tokens in alphabetical order of the token sequences. Let (t 1,, t n ) be a token sequence of length n (i.e., n-gram). Then, all the records that contain (t 1,, t n ) are expressed as adjacent members of the suffix array. Conversely, the adjacent members of the suffix array who have the same top n tokens (t 1,, t n ) catalog all the records in the database that contain (t 1,, t n ). It follows that we can exhaustively check arbitrary length token sequences simply by iteratively scanning the suffix array structure. With the proposed suffix array-based blocking (SAB), such adjacent members of the suffix array are considered blocks. We select only such blocks that satisfy the following conditions. (1) The size of the block is between α l and α u. (2) The length of the longest common token sequence is not less than β. (3) The total number of common tokens is not less than γ. Here, (3) is calculated by applying pairwise DP matching to the members of the block. The parameters α l, α u, β, and γ were determined heuristically in later experiments. The computation time of the proposed SAB is basically that needed for the sort operation to generate the suffix array structure. That is, with the built-in standard quick sort function, O(n log(n)), for text with n tokens. There are several features of SAB. First, unlike existing ranking methods, which require an index search for every block generation, SAB collectively generates the entire blocks simultaneously. Next, unlike existing sorting methods that use predetermined sorting keys, SAB can dynamically select keys as arbitrary length n-grams that satisfy the conditions above. These features are particularly advantageous when dealing with large-scale data Key-Based Blocking Key-based blocking (KB) collects neighbors of the reliable linkages using blocking keys that represent correspondence between specific attribute values. Given a list of key attributes, those blocking keys are automatically extracted from the pairs detected by SAB (Figure 4). The KB method can be considered to be a variation of knowledge-based methods [15, 16, 19] that employ equivalence rules between the corresponding attribute values. The difference is that KB automatically extracts the blocking keys, whereas most of the knowledge-based methods use equivalence rules given by human experts. Because each blocking key is connected to a distinctive block, they can be justified easily and evaluated based on the matching scores of the supporting record pairs, or verified by human reviewers. The framework of KB is also similar to that of the active learning proposed in [27]. The difference is that KB utilizes highly reliable positive examples, whereas active learning investigates neutral examples so that the boundary of the classifier can be refined further. 4. Experimental Results 4.1. Conditions for the Experiments In our experiments we used four bibliographic databases that were provided in our service operation in cooperation

6 DB1> <key attr1>str(1, 1)</key attr1> <key attr2>str(1, 2)</key attr2> <key attr3>str(1, 3)</key attr3> DB2> <key attr1>str(2, 1)</key attr1> <key attr2>str(2, 2)</key attr2> <key attr3>str(2, 3)</key attr3> Figure 4. Key-based blocking. with other institutions. The databases contain periodicals published by Japanese academic societies. Table 2 shows the total number of records used in the experiments. Because the four databases come from different sources, a considerable number of records were registered in the multiple databases. The objective of the experiment was to detect those duplicated entries so that the databases can be integrated with a unified service framework. The linkage process follows the general procedure that has been shown in Figure 1. Namely, the proposed blocking methods were first applied, and next, the detected candidate pairs were evaluated using a given matching function. Then, the candidate pairs were automatically labeled as either match, possible match, or non-match. Finally, all the pairs labeled as possible match were examined by human reviewers to ascertain whether they refer to the same article. In the experiments, the linkage was applied to the following six datasets: {(DB1-DB2), (DB1-DB3), (DB1-DB4), (DB2-DB3), (DB2- DB4), (DB3-DB4)}. 1 Database ID Number of records DB1 623,996 DB2 1,345,570 DB3 5,977,839 DB4 15,464,052 Table 2. Target databases. 1 Part of the DB1-DB2dataset (43,618 records) was used for initial check and tuning and was subtracted from the remaining dataset. When preparing the data, the attribute values were first normalized using general substitution rules (for example, uppercase to lowercase conversion), and then segmented into tokens, either characters or words, depending on the attribute types and the description languages (for example, names written in Japanese characters were segmented into characters, and the titles were segmented into words). In our experiments, we selected the following attributes: the name of the first author, the title of the article, the title of the journal, the volume and the number of the issue, the pages, and the year. For the first three, both the English and the Japanese descriptions were considered. In addition to these attributes, human reviewers may refer to other information sources as well, including the electronic or printed distribution of the articles. Also considered in human judgment were such factors as changes in journal titles, proceedings of the joint workshops, or reprinted articles. The matching function used for comparison, as well as the threshold values of the classifier, was determined heuristically because the dominant factor for performance in this case was the quality of the selection rather than the performance of the classifier. Because different databases follow different notations and conventions, the target problem is not an easy task. Taking a real example, of all the record pairs identified as duplicates by human reviewers, only 10% had their authors and titles match exactly after normalizing. Conversely, when the normalized authors and titles were used as keys for sorting, only 30% of the known duplicates were detected and 7% of these were false matches. Considering that our target performance value is 0.1% false-match errors and 1% false-non-match errors, we concluded that simple conventional sorting methods were not applicable to the datasets. In addition, due to the large scale of the problem, applying conventional ranking methods was not feasible with the limited computing resources available in our experimental environment. For these reasons, in this paper we only give the preliminary results that show the effect of the proposed SAB and KB. Detailed comparisons have been left for future study Selection Methods In the experiment, we investigated three alternative methods: Exact&KB, SAB&KB, and SAB-alone (Figure 5). Each method is described briefly in the following. Exact&KB With this method, all the pairs whose normalized authors and titles exactly match were first identified. Then, corresponding blocking keys were extracted. As the key attributes for KB, the journal title and volumes and numbers of the issue were used. Because only limited pairs were identified by Exact, we applied a simple and ad hoc

7 generalization to the extracted keys; that is, to substitute the volumes and numbers by variables. Using this generalization, the Exact method could compensate for the scarceness of the samples. Example of a generalized blocking key: DB2> <journal>jsme international journal. Series C, Mechanical systems, machine elements and manufacturing</journal><voln>$x ($Y)</voln> DB4> <journal>jsme Int Journal. Ser C. Mech Systems, Mach Elem Manuf</journal><voln>$X ($Y)</voln> Later, the instantiated variables of the generalized blocking keys were checked and inappropriate correspondences were excluded. The threshold values for KB to decide match, possible match, and non-match were adjusted so that: (i) all the match cases were correct, and (ii) as many candidate pairs as possible were labeled as possible-match so long as they were in the same block. SAB&KB With this method, the pairs extracted by SAB were also used to generate blocking keys. The parameters for SAB were α l = 2, α u = 4, β = 4, and γ = 12. The following is an example of a blocking key obtained for (DB2- DB4). Example of a blocking key: DB2> <journal>the Journal of Toxicological Sciences</journal><voln>21 (Supplement I)</voln> DB4> <journal>j Toxicol Sci</journal><voln>21 (Suppl 1)</voln> Note that there exist approximately 100,000 journal titles in total, and even for a single title, manually encoding all the correspondence patterns sometimes incurs considerable costs. For example, the exceptional cases include proceedings of joint workshops, special and seasonal editions, and supplementary issues. The correspondences between the journal issues specified by the generated blocking keys were verified later by human reviewers. SAB-alone For comparison purpose, we have also examined the case where the candidate pairs were directly extracted using SAB. The parameters for SAB were, again, α l = 2, α u = 4, β = 4, and γ = 12. In the experiment, SAB&KB was applied complementarily to Exact&KB, which means only those pairs that were not detected by Exact&KB were reported for human review. On the other hand, the results from SAB-only contained so many negative pairs that human review was not feasible. Consequently, SAB&KB covers almost all the known duplicates. As part of the CiNii operation, bigram-based blocking using an XML database engine had been separately carried out for the dataset (DB2-DB4), the result of which is available for comparison. Figure 5. Methods used in the experiments Results In our evaluation, we examined: (i) the execution time for SAB, (ii) the number of blocking keys automatically extracted by KB, and (iii) the number of pairs that required human review for each method as well as the errors in the detection. Execution time for SAB Table 3 shows the total number of records, the number of detected candidate record pairs, and the execution time of SAB, for the six datasets. The execution was on-memory using a single CPU of Sun Fire V880 (900MHz, 64GBytes memory). The Qsort function of the standard C library was used. The relationship between the execution time and the total number of records is also shown in Figure 6. Dataset name Total number of records Number of candidate pairs Time (min.) DB1-DB2 1,969,566 1,168, DB1-DB3 6,601, , DB1-DB4 16,088, , DB2-DB3 7,323,409 1,228, DB2-DB4 16,809,622 2,295, DB3-DB4 21,441,891 5,086, Table 3. Execution time for SAB. As a comparison, it took approximately 0.2 seconds with (DB3-DB4) dataset for the XML database engine with bigram index to generate a candidate block given a target record (using HPC CPU/128GB cluster PC). This means that the blocking time totals approximately two weeks when all the target record is examined. Because the blocking time of SAB for the same dataset was approximately 10 hours, we concluded that the computation cost of SAB was significantly reduced. Keys extracted by KB Table 4 compares the number of extracted blocking keys using Exact&KB and SAB&KB, re-

8 Figure 6. Execution time for SAB for different dataset sizes. spectively. The figures in the parentheses for Exact&KB are the numbers of manually defined patterns for the key generalization. A considerable number of additional keys were obtained by SAB. Dataset Name Keys obtained by Exact Additional keys obtained by SAB DB1-DB2 148 (4) 428 DB1-DB3 612 (6) 5,552 DB1-DB4 806 (8) 2,189 DB2-DB3 464 (10) 5,172 DB2-DB4 356 (9) 900 DB3-DB4 6,016 (12) 38,614 Table 4. Number of blocking keys. Performance comparison Table 5 summarizes the performance of Exact&KB, SAB&KB, and SAB-only. The table shows: (i) the number of the pairs that were identified as duplicates by the classifier, (ii) the number of the pairs that required human review, (iii) the number of the pairs that were falsely identified as duplicates, and (iv) the total number of correctly identified duplicate pairs. The figures in the parentheses for (iii) and (iv) represent the detected candidates as a percentage of the known duplicates found by SAB&KB. Note that human reviews were applied only to those pairs that had been found by SAB&KB. For Exact&KB and SAB&KB, all the pairs labeled as match by the classifier were automatically considered as correct. The justification was also confirmed by the human reviewers. For these cases, (iii) also includes the cost of checking the correspondences of blocking keys automatically extracted. For SAB-only, (iii) represents the number of records with at least one duplication candidate. This is because human reviewers check those candidates simultaneously on the same screen. For SAB-only, the following three ideal cases were compared: first, all the detected candidates were considered as possible match. The ratio of falsely identified pairs (p 1 ), and the ratio of missed duplications (p 2 ) were not specified. The threshold values for match, possible match, and non-match were simply determined as the highest and the lowest scores of the negative and the positive examples, respectively. In the second case, p 1 was set to 0.1% and the threshold value for match and possible match were optimized so that the number of possible match was minimized given p 1. p 2 was not specified. In the third case, p 1 was maintained as 0.1%, and p 2 was set to 5%. In this case, the threshold values for possible match and non-match were optimized so that the number of possible match was minimized given p 1 and p 2. To summarize the results in Table 5, SAB&KB showed the best performance, showing the effectiveness of combining these two blocking schemes. In addition, the following observations were made. SAB&KB successfully found additional duplicates that could not be found by Exact&KB. This is also shown in Table 4 where considerable numbers of additional blocking keys were obtained by SAB&KB. Note that the performance of Exact&KB depends on the ad hoc generalization of the blocking keys. Without this generalization, the performance of Exact&KB was not comparable. The cost of human review was much smaller with SAB&KB and Exact&KB than with SAB-only. SABonly by itself can detect approximately 95% 99% candidate pairs found by SAB&KB. However, the number of human reviews was not drastically reduced with SAB-only, even when the quality requirement was relaxed. Finally, the existing CiNii system found an additional 476 duplicates for the dataset (DB2-DB4) after 262,829 human judgments. This means that the estimated false-non-match ratio for SAB&KB was approximately 0.2%. Conversely, SAB&KB found 1,471 duplications that were not identified by the existing CiNii system only after 15,715 human judgments. The figures demonstrated the effectiveness of the proposed SAB&KB. In the future, we plan to carry out further comparison with existing blocking methods using more comprehensive human review results. 5. Conclusion In this paper, we propose a new blocking scheme for large-scale record linkage problems. Through the preliminary experiments using real-world bibliographic databases, the effectiveness of the proposed method both in computa-

9 Method Dataset name Automatically identified duplicates Total number of human judgments Falsely identified pairs (%) Total number of correctly identified duplicates (%) SAB-only DB1-DB2 1, ,746 0 (0.0) 88,655 (97.3) p 1 = * DB1-DB3 1, ,890 0 (0.0) 265,175 (98.9) p 2 = * DB1-DB ,367 0 (0.0) 325,198 (99.2) DB2-DB3 2, ,587 0 (0.0) 184,390 (95.6) DB2-DB ,728 0 (0.0) 199,441 (98.4) DB3-DB4 1,500 2,981,888 0 (0.0) 1,339,823 (99.4) SAB-only DB1-DB2 60, , (0.1) 88,655 (97.3) p 1 = DB1-DB3 21, , (0.1) 265,175 (98.9) p 2 = * DB1-DB4 3, , (0.1) 325,198 (99.2) DB2-DB3 3, , (0.1) 184,390 (95.6) DB2-DB4 44, , (0.1) 199,441 (98.4) DB3-DB4 14,736 2,968,652 1,348 (0.1) 1,339,823 (99.4) SAB-only DB1-DB2 60,503 27, (0.1) 86,564 (95.0) p 1 = DB1-DB3 21, , (0.1) 254,722 (95.0) p 2 = 0.05 DB1-DB4 3, , (0.1) 311,324 (95.0) DB2-DB3 3, , (0.1) 183,230 (95.0) DB2-DB4 44, , (0.1) 192,475 (95.0) DB3-DB4 14,736 1,295,445 1,348 (0.1) 1,280,674 (95.0) Exact&KB DB1-DB2 71,856 4,561 0 (0.0) 76,124 (83.5) DB1-DB3 216,155 14,525 0 (0.0) 225,899 (84.3) DB1-DB4 299,356 13,863 0 (0.0) 312,810 (95.5) DB2-DB3 157,304 14,105 0 (0.0) 169,996 (88.1) DB2-DB4 179,920 13,134 0 (0.0) 192,729 (95.1) DB3-DB4 1,068,498 96,473 0 (0.0) 1,164,971 (86.4) SAB&KB DB1-DB2 86,051 5,931 0 (0.0) 91,121 (100.0) DB1-DB3 254,713 21,877 0 (0.0) 268,129 (100.0) DB1-DB4 314,453 20,318 0 (0.0) 327,710 (100.0) DB2-DB3 182,779 17,418 0 (0.0) 192,874 (100.0) DB2-DB4 190,637 15,715 0 (0.0) 202,606 (100.0) DB3-DB4 1,256, ,507 0 (0.0) 1,348,078 (100.0) Table 5. Performance comparison. tion time and in the cost of human review has been demonstrated. Although the experiments in this paper mainly focus on records of traditional databases, we expect that the proposed method will be an initial step towards practical Web resources exploitation, in that: (i) the matching results enabled us to generate largescale dictionaries of entities with their notational variations and acronyms. In addition, heuristic normalization rules can automatically be identified. Such resources are in most cases crucial in identifying and analyzing descriptions of entities that appear in unformatted text on the Web. (ii) because the proposed SAB method considers each record as plain text, the method is directly applicable to semi-structured or unstructured data, the examples of which include home pages of researchers or research organizations, references to full-text papers, and other bibliographic collections that constitute the academic infrastructure on the Web. Future work includes incorporating machine learning techniques to the extraction of blocking keys and automatic acquisition of normalization dictionaries from the verified linkages. Acknowledgments The work presented here has been partially supported by the Grant-in-Aid Scientific Research on Priority Area Informatics (Area #006) funded by MEXT, Japan. The authors would like to thank Jun Adachi, Atsuhiro Takasu and other colleagues at NII for their useful comments and discussions. References [1] R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. Proc. of ACM SIGKDD 03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003.

10 [2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD2003), pages 39 48, [3] P. Christen and T. Churches. Febrl freely extensible biomedical record linkage. Computer Science Technical Reports, TR-CS-02-05, Australian National University, [4] W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (KDD2002), pages , [5] D. Dey, S. Sarkar, and P. De. A probabilistic decision model for entity matching in heterogeneous databases. Management Science, (10): , [6] D. Dey, S. Sarkar, and P. De. A distance-based approach to entity reconcilation in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, (3), [7] H. L. Dunn. Record linkage. American Journal of Public Health, (36): , [8] A. N. Expert. A Book He Wrote. His Publisher, Erewhon, NC, [9] A. Feekin and Z. Chen. Duplicate detection using k-way sorting method. Proceedings of the ACM Symposium on Applied Computing (SAC) 2000, pages , [10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), pages , [11] L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for data cleansing and integration in an rdbms. Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE 2003), pages , [12] L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. Proceedings of the 12th ACM International Conference on World Wide Web (WWW 2003), pages , [13] L. Gu and R. Baxter. Adaptive filtering for efficient record linkage. Proceedings of the 2004 SIAM International Conference on Data Mining (SDM04), [14] L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. CMIS Technical Report 03/83, [15] M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD 1995), pages , [16] M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, (1):9 37, [17] M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, (406):1989, [18] E. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. Proceedings of the 9th IEEE International Conference on Data Engineering (ICDE 1993), pages , [19] W. L. Low, M. L. Lee, and T. W. Ling. A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, pages , [20] J. T. Marshall. Canada s national vital statistics index. Population Studies, (2): , [21] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000), pages , [22] A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. Proceedings of the 2003 IJICAI Workshop on Information Integration on the Web (IIWeb-03), page [23] M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for automatic object consolidation. Proceedings of the ACM SIGKDD 03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, [24] A. E. Monge. Matching algorithms within a duplicate detection system. IEEE Transaction on Data Engineering, (4):14 20, [25] A. E. Monge and C. P. Elkan. An efficient domainindependent algorithm for detecting approximately duplicate database records. Proceedings of the ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, [26] H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, (3381): , [27] S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining (KDD2002), pages , [28] S. Y. Sung, Z. Li, and P. Sun. A fast filtering scheme for large database cleansing. Proceedings of the 2002 ACM Conference on Information and Knowledge Management (CIKM 2002), pages 76 83, [29] A. Takasu. Bibliographic attribute extraction from errorneous references based on a statistical model. Proceedings of the Joint Conference on Digital Libraries (JCDL) 2003, pages 49 60, [30] S. Tejada, C. A. Knoblock, and S. Minton. Learning domainindependent string transformation weights for high accuracy object identification. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages , [31] V. S. Verykios, A. K. Elmagarmid, and E. N. Houstis. Automating the approximate record matching process. Journal of Information Sciences, (1-4):83 98, 2000.

Comparison of Online Record Linkage Techniques

Comparison of Online Record Linkage Techniques International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI.

More information

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD) American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized

More information

Learning Blocking Schemes for Record Linkage

Learning Blocking Schemes for Record Linkage Learning Blocking Schemes for Record Linkage Matthew Michelson and Craig A. Knoblock University of Southern California Information Sciences Institute, 4676 Admiralty Way Marina del Rey, CA 90292 USA {michelso,knoblock}@isi.edu

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Unsupervised Detection of Duplicates in User Query Results using Blocking

Unsupervised Detection of Duplicates in User Query Results using Blocking Unsupervised Detection of Duplicates in User Query Results using Blocking Dr. B. Vijaya Babu, Professor 1, K. Jyotsna Santoshi 2 1 Professor, Dept. of Computer Science, K L University, Guntur Dt., AP,

More information

CHAPTER 7 CONCLUSION AND FUTURE WORK

CHAPTER 7 CONCLUSION AND FUTURE WORK CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication

More information

Object Matching for Information Integration: A Profiler-Based Approach

Object Matching for Information Integration: A Profiler-Based Approach Object Matching for Information Integration: A Profiler-Based Approach AnHai Doan Ying Lu Yoonkyong Lee Jiawei Han {anhai,yinglu,ylee11,hanj}@cs.uiuc.edu Department of Computer Science University of Illinois,

More information

Community Mining Tool using Bibliography Data

Community Mining Tool using Bibliography Data Community Mining Tool using Bibliography Data Ryutaro Ichise, Hideaki Takeda National Institute of Informatics 2-1-2 Hitotsubashi Chiyoda-ku Tokyo, 101-8430, Japan {ichise,takeda}@nii.ac.jp Kosuke Ueyama

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Similarity Joins of Text with Incomplete Information Formats

Similarity Joins of Text with Incomplete Information Formats Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach

More information

A Survey on Removal of Duplicate Records in Database

A Survey on Removal of Duplicate Records in Database Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,

More information

Constraint-Based Entity Matching

Constraint-Based Entity Matching Constraint-Based Entity Matching Warren Shen Xin Li AnHai Doan University of Illinois, Urbana, USA {whshen, xli1, anhai}@cs.uiuc.edu Abstract Entity matching is the problem of deciding if two given mentions

More information

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

A Review on Unsupervised Record Deduplication in Data Warehouse Environment A Review on Unsupervised Record Deduplication in Data Warehouse Environment Keshav Unde, Dhiraj Wadhawa, Bhushan Badave, Aditya Shete, Vaishali Wangikar School of Computer Engineering, MIT Academy of Engineering,

More information

Comprehensive and Progressive Duplicate Entities Detection

Comprehensive and Progressive Duplicate Entities Detection Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

A Learning Method for Entity Matching

A Learning Method for Entity Matching A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China 5500002@ecnu.cn,

More information

An Iterative Approach to Record Deduplication

An Iterative Approach to Record Deduplication An Iterative Approach to Record Deduplication M. Roshini Karunya, S. Lalitha, B.Tech., M.E., II ME (CSE), Gnanamani College of Technology, A.K.Samuthiram, India 1 Assistant Professor, Gnanamani College

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Peter Christen Ross Gayler 2 Department of Computer Science, The Australian National University, Canberra 2,

More information

Efficient Record De-Duplication Identifying Using Febrl Framework

Efficient Record De-Duplication Identifying Using Febrl Framework IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 10, Issue 2 (Mar. - Apr. 2013), PP 22-27 Efficient Record De-Duplication Identifying Using Febrl Framework K.Mala

More information

Sorted Neighborhood Methods Felix Naumann

Sorted Neighborhood Methods Felix Naumann Sorted Neighborhood Methods 2.7.2013 Felix Naumann Duplicate Detection 2 Number of comparisons: All pairs 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 400 comparisons 12

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH

AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH Dezhao Song and Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University 11/10/2011

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Adaptive Windows for Duplicate Detection

Adaptive Windows for Duplicate Detection Adaptive Windows for Duplicate Detection Uwe Draisbach #1, Felix Naumann #2, Sascha Szott 3, Oliver Wonneberg +4 # Hasso-Plattner-Institute, Potsdam, Germany 1 uwe.draisbach@hpi.uni-potsdam.de 2 felix.naumann@hpi.uni-potsdam.de

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

DATA POSITION AND PROFILING IN DOMAIN-INDEPENDENT WAREHOUSE CLEANING

DATA POSITION AND PROFILING IN DOMAIN-INDEPENDENT WAREHOUSE CLEANING DATA POSITION AND PROFILING IN DOMAIN-INDEPENDENT WAREHOUSE CLEANING C.I. Ezeife School of Computer Science, University of Windsor Windsor, Ontario, Canada N9B 3P4 cezeife@uwindsor.ca A. O. Udechukwu,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Efficient Entity Matching over Multiple Data Sources with MapReduce

Efficient Entity Matching over Multiple Data Sources with MapReduce Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br

More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

Deduplication of Hospital Data using Genetic Programming

Deduplication of Hospital Data using Genetic Programming Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department

More information

Approximate String Joins

Approximate String Joins Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different

More information

Improving effectiveness in Large Scale Data by concentrating Deduplication

Improving effectiveness in Large Scale Data by concentrating Deduplication Improving effectiveness in Large Scale Data by concentrating Deduplication 1 R. Umadevi, 2 K.Kokila, 1 Assistant Professor, Department of CS, Srimad Andavan Arts & Science College (Autonomous) Trichy-620005.

More information

R 2 D 2 at NTCIR-4 Web Retrieval Task

R 2 D 2 at NTCIR-4 Web Retrieval Task R 2 D 2 at NTCIR-4 Web Retrieval Task Teruhito Kanazawa KYA group Corporation 5 29 7 Koishikawa, Bunkyo-ku, Tokyo 112 0002, Japan tkana@kyagroup.com Tomonari Masada University of Tokyo 7 3 1 Hongo, Bunkyo-ku,

More information

Performance and scalability of fast blocking techniques for deduplication and data linkage

Performance and scalability of fast blocking techniques for deduplication and data linkage Performance and scalability of fast blocking techniques for deduplication and data linkage Peter Christen Department of Computer Science The Australian National University Canberra ACT, Australia peter.christen@anu.edu.au

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Assessing Deduplication and Data Linkage Quality: What to Measure?

Assessing Deduplication and Data Linkage Quality: What to Measure? Assessing Deduplication and Data Linkage Quality: What to Measure? http://datamining.anu.edu.au/linkage.html Peter Christen and Karl Goiser Department of Computer Science, Australian National University,

More information

Object Identification with Attribute-Mediated Dependences

Object Identification with Attribute-Mediated Dependences Object Identification with Attribute-Mediated Dependences Parag Singla and Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 98195-2350, U.S.A. {parag,

More information

Data Linkage Techniques: Past, Present and Future

Data Linkage Techniques: Past, Present and Future Data Linkage Techniques: Past, Present and Future Peter Christen Department of Computer Science, The Australian National University Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html

More information

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING

More information

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task Makoto Iwayama *, Atsushi Fujii, Noriko Kando * Hitachi, Ltd., 1-280 Higashi-koigakubo, Kokubunji, Tokyo 185-8601, Japan makoto.iwayama.nw@hitachi.com

More information

Structure Mining for Intellectual Networks

Structure Mining for Intellectual Networks Structure Mining for Intellectual Networks Ryutaro Ichise 1, Hideaki Takeda 1, and Kosuke Ueyama 2 1 National Institute of Informatics, 2-1-2 Chiyoda-ku Tokyo 101-8430, Japan, {ichise,takeda}@nii.ac.jp

More information

A fast approach for parallel deduplication on multicore processors

A fast approach for parallel deduplication on multicore processors A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br

More information

doi: / _32

doi: / _32 doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Mining Heterogeneous Transformations for Record Linkage

Mining Heterogeneous Transformations for Record Linkage Mining Heterogeneous Transformations for Record Linkage Matthew Michelson and Craig A. Knoblock University of Southern California Information Sciences Institute, 4676 Admiralty Way Marina del Rey, CA 90292

More information

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search Sampling Selection Strategy for Large Scale Deduplication for Web Data Search R. Lavanya 1*, P. Saranya 2, D. Viji 3 1 Assistant Professor, Department of Computer Science Engineering, SRM University, Chennai,

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Log Linear Model for String Transformation Using Large Data Sets

Log Linear Model for String Transformation Using Large Data Sets Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,

More information

Data Partitioning for Parallel Entity Matching

Data Partitioning for Parallel Entity Matching Data Partitioning for Parallel Entity Matching Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Department of Computer Science, University of Leipzig 04109 Leipzig, Germany

More information

A Versatile Record Linkage Method by Term Matching Model Using CRF

A Versatile Record Linkage Method by Term Matching Model Using CRF A Versatile Record Linkage Method by Term Matching Model Using CRF Quang Minh Vu, Atsuhiro Takasu, and Jun Adachi National Insitute of Informatics, Tokyo 101-8430, Japan {vuminh,takasu,adachi}@nii.ac.jp

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection Marnix de Bakker, Flavius Frasincar, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands

More information

Query-Time Entity Resolution

Query-Time Entity Resolution Query-Time Entity Resolution Indrajit Bhattacharya University of Maryland, College Park MD, USA 20742 indrajit@cs.umd.edu Lise Getoor University of Maryland, College Park MD, USA 20742 getoor@cs.umd.edu

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

IRCE at the NTCIR-12 IMine-2 Task

IRCE at the NTCIR-12 IMine-2 Task IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Ayaka ONISHI 1, and Chiemi WATANABE 2 1,2 Graduate School of Humanities and Sciences, Ochanomizu University,

More information

Toward Entity Retrieval over Structured and Text Data

Toward Entity Retrieval over Structured and Text Data Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science, University of Illinois at Urbana-Champaign {sayyadia,

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources Indian Journal of Science and Technology, Vol 8(23), DOI: 10.17485/ijst/2015/v8i23/79342 September 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Ontology-based Integration and Refinement of Evaluation-Committee

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

Robust Shape Retrieval Using Maximum Likelihood Theory

Robust Shape Retrieval Using Maximum Likelihood Theory Robust Shape Retrieval Using Maximum Likelihood Theory Naif Alajlan 1, Paul Fieguth 2, and Mohamed Kamel 1 1 PAMI Lab, E & CE Dept., UW, Waterloo, ON, N2L 3G1, Canada. naif, mkamel@pami.uwaterloo.ca 2

More information

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours

More information

(Big Data Integration) : :

(Big Data Integration) : : (Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?

More information

Finding the k-closest pairs in metric spaces

Finding the k-closest pairs in metric spaces Finding the k-closest pairs in metric spaces Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN kurasawa@nii.ac.jp Atsuhiro Takasu National Institute of Informatics 2-1-2

More information

selection of similarity functions for

selection of similarity functions for Evaluating Genetic Algorithms for selection of similarity functions for record linkage Faraz Shaikh and Chaiyong Ragkhitwetsagul Carnegie Mellon University School of Computer Science ISRI - Institute for

More information

Annotated Suffix Trees for Text Clustering

Annotated Suffix Trees for Text Clustering Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

Self-Organizing Maps for cyclic and unbounded graphs

Self-Organizing Maps for cyclic and unbounded graphs Self-Organizing Maps for cyclic and unbounded graphs M. Hagenbuchner 1, A. Sperduti 2, A.C. Tsoi 3 1- University of Wollongong, Wollongong, Australia. 2- University of Padova, Padova, Italy. 3- Hong Kong

More information

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM G.Amlu #1 S.Chandralekha #2 and PraveenKumar *1 # B.Tech, Information Technology, Anand Institute of Higher Technology, Chennai, India

More information

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

Active Sampling for Constrained Clustering

Active Sampling for Constrained Clustering Paper: Active Sampling for Constrained Clustering Masayuki Okabe and Seiji Yamada Information and Media Center, Toyohashi University of Technology 1-1 Tempaku, Toyohashi, Aichi 441-8580, Japan E-mail:

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Fuzzy Cognitive Maps application for Webmining

Fuzzy Cognitive Maps application for Webmining Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information