A Learning Method for Entity Matching

Size: px
Start display at page:

Download "A Learning Method for Entity Matching"


1 A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China {cqjin, rzhang, ABSTRACT Entity matching that aims at finding similar records referring to the same entity is critical in various fields, such as data cleaning and data integration. There mainly exist two kinds of methods for entity matching, including the classification-based method and the rule-based method. The latter is more popular because it is highly scalable, easy-toimplement and explainable. However, selection of the proper thresholds, distance functions and rules has been one of the main challenges for the rule-based method. This paper focuses on devising a method to choose the proper rules automatically, as well as selecting the appropriate distance functions and thresholds. According to the training data, we define a metric to quantify the appropriateness of each rule, based on which a heuristic method is proposed. Experimental results on real data sets illustrate the high effectiveness and efficiency of our method.. INTRODUCTION Nowadays, data is vital in many fields, including finance, industry, medical and technology, etc. However, due to the various subjective and objective reasons such as the different schemas, mistaken spellings, missing data values, data stored in data sources is inaccurate, inconsistent and incomplete[9, 2, 4]. Under such circumstances, data quality is always a big issue. Poor data quality can impair the effectiveness of querying and analysis tasks in many applications, which makes it essential to take some measurements to clean and repair the incorrect data. An important task in data cleaning and integration is entity matching that aims at finding record pairs referring to the same entity in the real world. For example, Cora data set records a number of citations referring to different literatures. Table illustrates a small fraction of tuples in this data set, where four citations are referring to two different Corresponding author tar.gz literatures. It is non-trivial to perform entity matching by analyzing the attributes since value in any attribute may be noisy and missing. For example, tuple and 2 refer to the same entity though they have different titles; tuple 2 and 3 refer to different entities although they share the same author and title. Entity matching has been widely studied for decades [6, 0,, 6, 2, 5, 8, 4, 7, 5, 3, 7, 3, 8]. Existing work can be mainly classified into two categories, including the classification-based method and the rule-based method [7]. The former extracts feature vectors by using statistics, machine learning and artificial intelligence techniques, based on which we can classify the record pairs as matching or non-matching [, 6]. The latter aims at building rules to match records [5, 7, 5]. For instance, two citations in the Cora data set with small distance in author, title and date attributes can be roughly treated as referring to the same literature. In general, the rule-based method is easier to implement and scale out than the classification-based method, making it more widely used in real applications. One significant challenge of the rule-based method is d- ifficult to choose proper thresholds, distance functions and rules. Firstly, it is hard to set a proper threshold under a given distance function. For example, in Table, is it sufficient to claim that two citations refer to the same literature if their edit distance in title attribute is below 0.3? If not, how about 0.? Secondly, there exist many distance functions to measure the distance between the corresponding attribute of two citations, such as Jaccard distance, edit distance, Cosine distance, q-gram distance and so on. How to choose a proper distance function for each attribute is difficult. Thirdly, since there are multiple rules and each rule contains a group of attributes, it is quite hard to decide which rule is proper for entity matching. Although such three factors can be decided by experience, it is much challenging when the volume of the data set is quite huge. The most related work to our paper is SiFi [7]. Given several rules and a training data set, this method uses climbing algorithm to find optimal similarities and thresholds for each rule. However, this approach requires to obtain some appropriate rules in advance, without which the performance may decrease dramatically. Contrarily, our goal is to generate proper rules automatically, as well as choosing proper distance functions and setting thresholds through a training data set. It is interesting to decide the proper threshold for any distance function. Here, we simply assume two tuples referring to the same entity if their distance is below a given threshold

2 Table : An example in Cora dataset Record Entity ID ID Author Title Address Date carla e. brodley and paul e. utgoff. multivariate versus univariate decision trees. amherst ma c. e. brodley and p. e. utgoff. multivariate decision trees. amherst massachusetts c. e. brodley and p. e. utgoff. multivariate decision trees. NULL carla e. brodley and paul e. utgoff. multivariate decision trees. NULL 995. value. However, when such threshold value is set so small, too many matching pairs may be ignored. Otherwise if the threshold value is set relatively big, some non-matching pairs may be mistreated as matching pairs. It is worth noting that precision and recall are insufficient to measure the appropriateness of a threshold, i.e, the former situation means high precision and low recall, while the latter means low precision and high recall. Hence, we use F-score instead, which is the harmonic mean of precision and recall. Note that F-Score is the special case of F β measure. Parameter β balances the weight of precision and recall and it equals to in F-score. If a list of possible terrorists is crossed with the list of passengers of a given flight, it should be better to have a higher recall than precision and we can set β great than. Another example is the marketing campaign, it should be better to have a higher precision than recall and we can set β less than. Thus we can tune the parameter β according to the different applications. In our paper, we treat precision and recall equally. We review these three measures below. precision = recall = F-score = # of correctly identified duplicate pairs # of identified duplicate pairs # of correctly identified duplicate pairs # of true duplicate pairs 2 precision recall precision + recall Given a training data set, we treat F-score as the target measure to find a proper threshold with maximum F-score for any distance function. Meanwhile, in any attribute, the distance function with maximum F-score is chosen as the proper distance function; the attribute with maximum F- score is treated as the proper attribute. Moreover, since multiple attributes may take cooperative effects for entity matching, we further explore a complicated method which groups multiple attributes to measure entity similarity. The group of attributes with maximum F-score is chosen as the proper rule for entity matching. In this way, we can decide the thresholds, distance functions and rules automatically. Hence, any record pair with distance below the proper threshold by using the proper distance function under the proper rule is treated as the same entity. It is worth noting that our new method also fully consider the different capability of each attribute for entity matching. To summarize, we make the following contributions. We define a new concept based on F-score to measure the appropriateness of how to set the proper thresholds, how to choose the proper distance functions and rules. We propose a new method to find proper thresholds, distance functions and rules automatically, which overcomes a major challenge of the rule-based entity matching. The experiment results on real data sets illustrate the high efficiency and effectiveness of our method. The rest of the paper is organized as follows. We define the problem formally in Section 2. In the next section, we propose our approach to solve entity matching and devise a heuristic algorithm to find an appropriate rule for entity matching. Section 4 reports the experimental results. Section 5 reviews the related work. Finally, we conclude this paper briefly in the last section. 2. PROBLEM DEFINITION Let E[A, A 2,, A k ] be a relation containing n tuples, denoted as e, e 2,, e n. Here, A, A 2,, A k denotes k attributes in E. There exist various distance functions to describe the distance between a pair of attribute values. Typical distance functions include edit distance, Jaro distance, Jaccard distance, Cosine distance, Euclidean distance, q- gram distance [, 2, 20], soundex distance [9] and so on. Let D = {dis, dis 2,, dis m} denote m distance functions. For any dis D, let dis(e i.a, e j.a) denote the distance of the attribute A for any tuple pair (e i, e j) by using dis function. Given a distance function dis and an attribute A, we first define two concepts, namely Maximum F-score and Proper Threshold. The Maximum F-score describes the maximum F-score in attribute A by using function dis, while the Proper Threshold describes a threshold where the Maximum F- score is achieved. Both values can be learned by using a training data set. Definition (Maximum F-score). Given a distance function dis and an attribute A, we define the Maximum F-score, MAXF dis,a (E), as the maximum F-Score in the attribute A for a relation E by using function dis. Definition 2 (Proper Threshold). Given a distance function dis and an attribute A, we define the Proper Threshold, PT dis,a (E), as a distance threshold where the Maximum F-Score MAXF dis,a (E) exists. Since there exist multiple distance functions, we need to check which distance function is the most proper one for each atribute. For example, table 2 illustrates the Maximum F- score and Proper threshold, as well as precision and recall by comparing the Jaccard distance and edit distance for the title attribute in Cora data set. In this case, the edit distance is more powerful than Jaccard distance because of a greater MAXF dis,a (E) value. Hence, we define the attribute-level Maximum F-score for any attribute A below. Definition 3 (Attribute-level Maximum F-score). Given an attribute A, the attribute-level Maximum F-score of a relation E, MAXF A(E), is the maximum MAXF dis,a (E) for all distance functions. MAXF A(E) = max dis D (MAXF dis,a(e))

3 Table 2: MAXF and PT for title in Cora data set Distance Function MAXF PT Precision Recall Jaccard distance Edit distance Table 3: MAXF, PT and PD in Cora data set Attribute MAXF Proper Threshold Proper Distance Author q-gram distance Title edit distance Address q-gram distance Date Jaro distance Pages q-gram distance Volume q-gram distance Publisher q-gram distace Editor q-gram distance Journal q-gram distance At the same time, we also define the attribute-level Proper Threshold, PT A, as the threshold used for the attribute-level maximum F-score; define the attribute-level Proper Distance function, PD A, as the distance function used for the attributelevel maximum F-score. As there exists multiple distance functions and multiple attributes, it is interesting to find the best distance function and attribute for entity matching. Hence, we define the relation-level Maximum F-score as the maximum attributelevel Maximum F-score, as shown below. Definition 4 (Relation-level Maximum F-score). We define the relation-level Maximum F-score of a relation E, MAXF(E), as the maximum MAXF A(E) for all attributes (say, A). MAXF (E) = max(maxf A(E)) A E Correspondingly, relation-level Proper Threshold (P T (E)) and relation-level Proper Distance function (P D(E)) can be defined in a similar way. Now, we can claim a tuple pair referring to the same entity if their distance computed by P D(E) is below P T (E). Otherwise, the pair refers to different entities. Table 3 illustrates the attribute-level MAXF, P T A and P D A for Cora data set. We can also observe that the relation-level MAXF is 0.94, achieved by edit distance(p D(E)) upon title attribute and P T (E) is An interesting but intuitive observation is that using multiple attributes to match a pair of entities is superior to only using single attribute since every attribute provide information in different aspect. Consequently, we also use a group of selected attributes for entity matching at the same time. We take Table as an example. If we only use single title attribute to find matching records, we will unavoidably mistreat tuple 2 and 3 as a matching pair. However, if we use the group of title and date attributes to find matching records, it is possible to treat them as a non-matching pair. We now show how to define the distance when using a group of attributes. Definition 5 (Attributes Group Distance). Let G denote an attribute group in E, G = {A (), A (2),, A ( G ) }, Table 4: Attribute Group in Cora data set Attribute Group MAXF Author 0.60 Volume 0.64 Author, Volume 0.70 Author, Volume, Page 0.73 Author, Volume, Page, Journal 0.70 G E. Given two tuples e i and e j, the attribute group distance is computed below. ( ) G MAXF GD G(e i, e j) = A (l) G l= h= MAXF P D A (l)(e i.a (l), e j.a (l) ) A (h) Remember that MAXF A (l) and PD A (l) describe the attributelevel maximum F-score and proper distance function for the attribute A (l) respectively. Since each attribute has different capability for entity matching, we treat MAXF A (l) as the weight of each proper distance function in attribute A (l). In general, higher MAXF value means more powerful capability in undertaking this task. Given an attribute group G, we define Group Maximum F-Score as follows. Definition 6 (Group Maximum F-score). Given an attribute group G, the Group Maximum F-score, MAXF G, is the maximum F-score achieved under G. Example. Table 4 illustrates a small example in Cora data set. Using group of author, volume and page attributes together achieves better performance. For a data set containing k attributes, there exist 2 k attribute groups (including the group only having one attribute), each with different group maximum F-score value. We define the Proper Group as the group with maximum value of MAXF G. Definition 7 (Proper Group). Given a relation E, we define the Proper Group, P G(E), as the group in E which returns maximum value of MAXF G for all attribute groups. It is worth mentioning that every attribute group stands for a rule and the proper group corresponds to an optimal rule for entity matching. Meanwhile our rule fully consider the different capabilities of attributes and don t just treat them equally. Since acquiring the proper group for a relation is costly, we devise an approximate method for this issue, as described in the next section. 3. OUR SOLUTION In this section, we begin to describe our solution in detail. The framework contains two phases, namely training phase and testing phase, as shown below.. Training phase: The goal in this phase is to find an attribute group with a maximum MAXF value by using a training set, denoted as Ĝ. 2. Testing phase: For any tuple pair in a relation E, we compute the distance between two tuples by using the proper distance functions for Ĝ. If the distance is below the proper threshold, return matching symbol. Otherwise, return non-matching symbol.

4 It is expensive to find the proper group when the value of k is great, since it requires to check 2 k attribute groups one by one in the training phase. To reduce the processing cost, we design a heuristic method instead, avoiding to test too many attribute groups. For example, let G and G denote two attribute groups, G G =. Let G denote the union of these two groups, i.e, G = G G. We then demonstrate how to test the necessariness of combining G with other groups in future. First, we compute MAXF G, MAX G and MAX G for these three groups. If MAXF G is greater than the other two values, G is treated as a potential group, and it will be used again in future. Otherwise, if MAXF G is smaller than any value, we treat group G has low probability to be a part of the final proper group. Hence, we will not generate any group based on G in future. In this way, we only test a small number of attribute groups, and experimental results show that our result is still quite close to the optimal situation. Algorithm getattributegroup (Algorithm ) illustrates detailed steps to get an appropriate attribute group for E. Parameter c limits the maximum number of attributes in each group. h, h c, the variable S h denotes a set containing all candidate attribute groups in which the number of attributes is h. The set S contains all candidate attribute groups. First, we test every attribute in E, and insert all triples like (MAXF A, P D A, P T A) into S (at Lines 2-6). Subsequently, we generate the candidate set S h incrementally by invoking subroutine getcandidategroups. If no more candidate groups are generated, we have finished creating S (at Lines 7-4). Finally, this algorithm returns a group in S with maximum MAXFĜ value (at Line 5). Algorithm getcandidategroups (Algorithm 2) describes how to generate new candidate attribute groups when given three parameters, S, S and h. In general, S is a set containing all the single attributes, while S contains all candidate attribute groups which has h attributes. The goal is to generate a candidate set where the number of attributes in each group is h. As introduced above, we test every union of G G where G S and G S. The new group will be inserted into the candidate set only when its MAXF value is greater than that of G and G (at Lines 2-7). Algorithm getattributegroup(e, c) : S = ; 2: for each attribute A in E do 3: Compute MAXF A, P D A and P T A; 4: S S {(MAXF A, P D A, P T A)}; 5: S S ; 6: end for 7: for attribute group size h = 2 to c do 8: S h getcandidategroups(s, S h, h); 9: if S h = then 0: break; : else 2: S S S h ; 3: end if 4: end for 5: Let Ĝ be a group in S with maximum MAXF Ĝ value; 6: return Ĝ; 4. EXPERIMENTS Algorithm 2 getcandidategroups(s, S, h) : S h = ; 2: for each G S, G S do 3: G = G G 4: if (MAXF G > max(maxf G, MAXF G ) then 5: S h S h {(MAXF G, P D G, P T G)}; 6: end if 7: end for 8: return S h ; 4. Experiment Setup We wrote the codes in Java and conducted the experiments in a Windows system. The CPU is Intel Core 2.0GHz and the physical memory is 2GB. We use two real data sets: Cora data set and Restaurant data set 2. The Cora data set contains 876 citations of 9 literatures and more than ten attributes. We extract nine attributes: author, title, address, date, page, volume, publisher, editor and journal. The Restaurant data set contains 864 tuples which records the basic information of restaurants. There exist four attributes: name, address, city and type and all of them are used. We use 0-fold cross-validation to evaluate our method. 4.2 Single-Attribute Performance We select five distance functions in the following tests, including Cosine distance, Jaccard distance, Jaro distance, edit distance and q-gram distance. Figure and 2 show the maximum F-scores for all attributes in Cora and Restaurant data sets respectively. We can observe that each attribute has different capability for entity matching. Here, the title attribute by using edit distance is the best choice for Cora data set, and the name attribute by using q-gram distance is the best choice for Restaurant data set. Indeed, we can find that single attribute title can already achieve great F-score in Cora data set and distance functions show little difference in F-scores for short strings. Meanwhile, We can see that the q-gram distance behaves well in most of attributes. Maximal F-Score Cosine Jaccard Jaro Edit Distance q-gram Author Title Address Date Page VolumePublisher Editor Journal Figure : The MAXF for the Cora data set 4.3 Attribute Group Performance Now we want to evaluate the performance of attribute group. Table 5 and 6 illustrate the top-5 attribute groups we have obtained from training data for Cora and Restaurant data sets respectively. 2

5 Maximal F-Score Cosine Jaccard Jaro Edit Distance q-gram Precision Best Attribute Group of Cora DataSet Best Attribute Group of Restaurant DataSet 0. 0 Name Address City Type Recall Figure 2: The MAXF for the Restaurant data set In Cora data set, the best attribute group with weighting of each attribute which achieves maximum MAXF G in training data is the group of author, title, address and journal attributes. We can find the top-3 groups can achieve maximum F-score values around 0.94 in both training data and test data. The maximum F-score values of 4th and 5th groups in training data is relatively low. Consequently, they are also lower than those of the top-3 groups in test data, which demonstrates our method is stable. Meanwhile, we can observe that the top-5 attribute groups without weighting is not satisfactory. Because of ignoring the different capabilities among attributes, it will produce less proper attribute groups. In Restaurant data set, the best attribute group with weighting of each attribute is the group of name and address attributes. We can also find that the maximum F-score value of the 5th attribute group is lower than those of the top-4 attribute groups in both training data and test data. However, due to few matching record pairs in training data, there exists a deviation of maximum F-score values between training data and test data. Also we can observe that the top-5 attribute groups without weighting is not satisfactory. Figure 3 is the recall-precision curve under the best attribute group for Cora and Restaurant data set. If the distance threshold is set high, we get a big value in recall and a small value in precision. Contrarily, if the distance threshold value is set low, we get a big value in precision and a small value in recall. Both situations cannot meet the requirements in real applications. Fortunately, there still exist appropriate threshold values so that both values are great (at the top-right corner) by using our method. In fact, under such situations, the F-score value is approximately maximized. Moreover, the result for the Cora data set is more effective than that for the Restaurant data set since the former has a greater MAXF G value. 4.4 Efficiency of the Heuristic Algorithm Now we want to evaluate the performance of our heuristic algorithm introduced in Section 3. We don t limit the attribute group size and set the argument maximal number of attributes to the number of all attributes. We compared it with Naive algorithm which enumerates all the attribute groups and then select the proper attribute group which has maximum F-score. Figure 4 gives the runtime which is used to find a best attribute group in training data for Cora data set by using heuristic and naive approach. Notice that the vertical axis Figure 3: The Recall-Precision Curve under the best attribute group is the power of 0 and horizontal axis is the number of attributes which is used in current test. With the increasing number of attributes, we can find our heuristic algorithm outperforms the naive algorithm. This is because we have pruned so many groups which has the low probability to be the propel attribute group, avoiding to consider them further. Runtime(milliseconds) k=9 k=8 k=7 k=6 k=5 k=4 k=3 k=2 k= Attribute Number k Naive Algorithm Heuristic Algorithm Figure 4: Comparison of Runtime under different k 4.5 Comparison with Existing Techniques Finally, we compare our method with Op-Trees [2] and SiFi-Hill [7]. We split the Cora and Restaurant data set into 2 folds for cross-validation. Figure 5 shows the final results. We can observe that our method outperforms the Op-Trees and SiFi-Hill. We can achieve the highest F-score since we seek to achieve maximum F-score under the appropriate attribute group. However, SiFi-Hill is strong dependent on given rules. Once the rules is chosen improperly, this method can never achieve good results. Our method can automatically select the appropriate rules. Meanwhile, as proposed in [7], Op-Trees does not consider the redundancy among similarity functions and SVM consumes so much time for entity matching. 5. RELATED WORK As is mentioned in the introduction, there mainly exist two categories to solve the problem of entity matching. The first category is classification-based method. Bilenko et al. [] propose to use SVM classifier to solve it and achieve better results. In the attribute level, it considers two groups of

6 Table 5: Top-5 attribute groups for Cora data set Rank Attribute Group With Weighting Training MaxF Testing MaxF Author, Title, Address, Journal Title, Volume Title Author, Address, Page, Volume, Editor Author, Address, Date, Page, Publisher, Editor, Journal Rank Attribute Group Without Weighting Training MaxF Testing MaxF Title Author, Date, Page, Volume, Editor Author, Page, Volume, Publisher, Editor, Journal Author, Page, Volume, Editor Author, Page, Volume Table 6: Top-5 attribute groups for Restaurant data set Rank Attribute Group Training Testing Attribute Group Training Testing Rank With Weighting MAXF MaxF Without Weighting MAXF MaxF Name, Address Name, Address Name, Address, City Name Name, Address, City, Type Address Name, Address, Type City, Type Name, City, Type Type Maximal F-Score Cora Dataset Restaurant Dataset Op-Trees SiFi-Hill Our_Method SVM Figure 5: Comparison with existing techniques similarity functions: character-based and vector-based similarity functions. Character-based group mainly uses the edit distance while the vector-based group mainly uses Cosine distance. In the entity level, it generates m k features, where m is the number of attributes and k is number of similarity functions. By acquiring the feature vector, SVM is used to train the classifier. This approach has high accuracy. However, Wang et al. [7] show that this approach consumes much time in testing data and is difficult to explain and scale out. The rule-based method is explainable and scalable. Wang et al. [7] first propose the problem of how similar is similar. By given a labeled data set, they observe that different similarity functions and thresholds have redundancy. So they first prune redundant similarity functions and thresholds, and then devise three efficient algorithms to find matching entities. SiFi-Hill outperforms SiFi-Greedy and SiFi-Gradient as it considers the factor of attributes dependence. Relatively speaking, our approach can find a proper rule, namely proper attribute group, as well as proper threshold and proper distance function. Moreover, we can more quickly find the proper threshold and proper distance function since we needn t to iterate to find an optimal one. Besides, Chaudhuri et al. [2] propose another explainable technique for record matching. This method constructs operator tree to solve this problem. Guo et al. [7] novelly propose k-partite graph clustering to solve it. It believes each attribute value in the same entity should overlap. Hence, each node in k-partite graph represents a value of each attribute. There exists an edge between two nodes if the value of each attribute is stored in the same record. After clustering of k-partite, it can perform entity matching and data fusion simultaneously. In recent years, the research work also extends the entity matching issue to temporal data and transactional data. Li et al. [3] apply time decay to consider the effect on time for entity matching. The main idea is that the same entity will change attribute s value over a long time and different entities also share the same attribute s value in a period time. Therefore, when we consider whether the entity is matched or not, we should take the time decay as a weight to calculate the similarity of entity. Yakout et al. [8] present a new entity matching approach that uses entity behavior to merge the transactional data. The main idea is that if the entity behavior becomes more regular and stable when merge two transactional records, it is more likely that these two records refer to the same entity. This method also introduces a phase of candidate generation to reduce most non-matching pairs. 6. CONCLUSION In this paper, we present a new learning method for entity matching. At first, we define a new concept based on F-score to find proper threshold, proper distance for each attribute and proper group for the relation. Then we propose our approach to solve entity matching and devise a heuristic algorithm to find the appropriate attribute group. The experiments on two real data sets show our method is efficient, effective and stable. 7. ACKNOWLEDGEMENT The research of Cheqing Jin is supported by the National Basic Research Program of China (Grant No. 202CB36200),

7 the Key Program of National Natural Science Foundation of China (Grant No ), National Natural Science Foundation of China (Grant No ). The research of Aoying Zhou is supported by National Science Foundation for Distinguished Young Scholars (Grant No ), and Natural Science Foundation of China (No ). 8. REFERENCES [] M. Bilenko and R. J. Mooney. Adadptive duplicate detection using learnable string similarity measures. In ACM SIGKDD, pages 39 48, Washington, ACM. [2] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages , Vienna, ACM. [3] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages , Vienna, ACM. [4] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 9(): 6, January [5] W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2():407 48, [6] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):83 20, 969. [7] S. Guo, X. Dong, D. Srivastava, and R. Zajac. Record linkage with uniqueness constraints and erroneous values. PVLDB, 3(0):47 428, 200. [8] M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. Sigmod Record, 24(2):27 38, 995. [9] D. O. Holmes and M. C. Mccabe. Improving precision and recall for soundex retrieval. In Proceedings of International Symposium on Information Technology, pages IEEE, 995. [0] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithems. In ACM SIGMOD, pages , New York, ACM. [] H. Lee, R. T. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In VLDB, pages ACM, [2] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages ACM, [3] P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(): , 20. [4] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. In ICDE, pages , Washington, 993. IEEE. [5] V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. Large-scale collective entity matching. PVLDB, 4(4):208 28, 20. [6] S. Sarawagi and A. Bhamidipaty. Iterative deduplication using active learning. In ACM SIGKDD, pages , New York, ACM. [7] J. Wang, G. Li, J. X. Yu, and J. Fen. Entity matching: How similar is similar. PVLDB, 0(20): , 4. [8] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi. Behavior based record linkage. PVLDB, 3(): , 200. [9] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5): , February 20. [20] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In ACM SIGMOD, pages ACM, [2] Y. C. Yuan. Multiple imputation for missing data: Concepts and new development. In In the 25th Annual SAS Users Group International Conference, 2002.

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu

More information

Similarity Joins of Text with Incomplete Information Formats

Similarity Joins of Text with Incomplete Information Formats Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.

More information

(Big Data Integration) : :

(Big Data Integration) : : (Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

Metric and Identification of Spatial Objects Based on Data Fields

Metric and Identification of Spatial Objects Based on Data Fields Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 25-27, 2008, pp. 368-375 Metric and Identification

More information

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD) American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized

More information

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information



More information

Deduplication of Hospital Data using Genetic Programming

Deduplication of Hospital Data using Genetic Programming Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

Sorted Neighborhood Methods Felix Naumann

Sorted Neighborhood Methods Felix Naumann Sorted Neighborhood Methods 2.7.2013 Felix Naumann Duplicate Detection 2 Number of comparisons: All pairs 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 400 comparisons 12

More information

Improving effectiveness in Large Scale Data by concentrating Deduplication

Improving effectiveness in Large Scale Data by concentrating Deduplication Improving effectiveness in Large Scale Data by concentrating Deduplication 1 R. Umadevi, 2 K.Kokila, 1 Assistant Professor, Department of CS, Srimad Andavan Arts & Science College (Autonomous) Trichy-620005.

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information


RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING

More information

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search Sampling Selection Strategy for Large Scale Deduplication for Web Data Search R. Lavanya 1*, P. Saranya 2, D. Viji 3 1 Assistant Professor, Department of Computer Science Engineering, SRM University, Chennai,

More information

Towards a Domain Independent Platform for Data Cleaning

Towards a Domain Independent Platform for Data Cleaning Towards a Domain Independent Platform for Data Cleaning Arvind Arasu Surajit Chaudhuri Zhimin Chen Kris Ganjam Raghav Kaushik Vivek Narasayya Microsoft Research {arvinda,surajitc,zmchen,krisgan,skaushi,viveknar}@microsoft.com

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Parallel Similarity Join with Data Partitioning for Prefix Filtering

Parallel Similarity Join with Data Partitioning for Prefix Filtering 22 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015 Parallel Similarity Join with Data Partitioning for Prefix Filtering Jaruloj Chongstitvatana 1 and Methus Bhirakit 2, Non-members

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Datasets Size: Effect on Clustering Results

Datasets Size: Effect on Clustering Results 1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}

More information

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

An Iterative Approach to Record Deduplication

An Iterative Approach to Record Deduplication An Iterative Approach to Record Deduplication M. Roshini Karunya, S. Lalitha, B.Tech., M.E., II ME (CSE), Gnanamani College of Technology, A.K.Samuthiram, India 1 Assistant Professor, Gnanamani College

More information

A Survey on Removal of Duplicate Records in Database

A Survey on Removal of Duplicate Records in Database Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

A semi-incremental recognition method for on-line handwritten Japanese text

A semi-incremental recognition method for on-line handwritten Japanese text 2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

RiMOM Results for OAEI 2010

RiMOM Results for OAEI 2010 RiMOM Results for OAEI 2010 Zhichun Wang 1, Xiao Zhang 1, Lei Hou 1, Yue Zhao 2, Juanzi Li 1, Yu Qi 3, Jie Tang 1 1 Tsinghua University, Beijing, China {zcwang,zhangxiao,greener,ljz,tangjie}@keg.cs.tsinghua.edu.cn

More information

Learning Blocking Schemes for Record Linkage

Learning Blocking Schemes for Record Linkage Learning Blocking Schemes for Record Linkage Matthew Michelson and Craig A. Knoblock University of Southern California Information Sciences Institute, 4676 Admiralty Way Marina del Rey, CA 90292 USA {michelso,knoblock}@isi.edu

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Survey of String Similarity Join Algorithms on Large Scale Data

Survey of String Similarity Join Algorithms on Large Scale Data Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

Learning Probabilistic Ontologies with Distributed Parameter Learning

Learning Probabilistic Ontologies with Distributed Parameter Learning Learning Probabilistic Ontologies with Distributed Parameter Learning Giuseppe Cota 1, Riccardo Zese 1, Elena Bellodi 1, Fabrizio Riguzzi 2, and Evelina Lamma 1 1 Dipartimento di Ingegneria University

More information

Texture Segmentation by Windowed Projection

Texture Segmentation by Windowed Projection Texture Segmentation by Windowed Projection 1, 2 Fan-Chen Tseng, 2 Ching-Chi Hsu, 2 Chiou-Shann Fuh 1 Department of Electronic Engineering National I-Lan Institute of Technology e-mail : fctseng@ccmail.ilantech.edu.tw

More information

LinkedMDB. The first linked data source dedicated to movies

LinkedMDB. The first linked data source dedicated to movies Oktie Hassanzadeh Mariano Consens University of Toronto April 20th, 2009 Madrid, Spain Presentation at the Linked Data On the Web (LDOW) 2009 Workshop LinkedMDB 2 The first linked data source dedicated

More information

Contour-Based Large Scale Image Retrieval

Contour-Based Large Scale Image Retrieval Contour-Based Large Scale Image Retrieval Rong Zhou, and Liqing Zhang MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems, Department of Computer Science and Engineering, Shanghai

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information


Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information



More information

Discovering Advertisement Links by Using URL Text

Discovering Advertisement Links by Using URL Text 017 3rd International Conference on Computational Systems and Communications (ICCSC 017) Discovering Advertisement Links by Using URL Text Jing-Shan Xu1, a, Peng Chang, b,* and Yong-Zheng Zhang, c 1 School

More information

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,

More information

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Query-Sensitive Similarity Measure for Content-Based Image Retrieval Query-Sensitive Similarity Measure for Content-Based Image Retrieval Zhi-Hua Zhou Hong-Bin Dai National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China {zhouzh, daihb}@lamda.nju.edu.cn

More information

Approximate String Joins

Approximate String Joins Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration 2010 Sixth International Conference on Semantics, Knowledge and Grids A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration Wei Liu 1,2, Xiaofeng Meng 3 1 Institute of Computer

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

Fast Algorithms for Top-k Approximate String Matching

Fast Algorithms for Top-k Approximate String Matching Fast Algorithms for Top-k Approximate String Matching Zhenglu Yang # Jianjun Yu Masaru Kitsuregawa # # Institute of Industrial Science, The University of Tokyo, Japan {yangzl, kitsure}@tkliisu-tokyoacjp

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect: Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA

More information


AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Lu Chen and Yuan Hang PERFORMANCE DEGRADATION ASSESSMENT AND FAULT DIAGNOSIS OF BEARING BASED ON EMD AND PCA-SOM.

More information

A fast approach for parallel deduplication on multicore processors

A fast approach for parallel deduplication on multicore processors A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br

More information



More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation

Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation Shashank Gugnani BITS-Pilani, K.K. Birla Goa Campus Goa, India - 403726 Rajendra Kumar Roul BITS-Pilani, K.K. Birla Goa Campus Goa,

More information

Efficient Entity Matching over Multiple Data Sources with MapReduce

Efficient Entity Matching over Multiple Data Sources with MapReduce Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

XML Data Stream Processing: Extensions to YFilter

XML Data Stream Processing: Extensions to YFilter XML Data Stream Processing: Extensions to YFilter Shaolei Feng and Giridhar Kumaran January 31, 2007 Abstract Running XPath queries on XML data steams is a challenge. Current approaches that store the

More information



More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece

More information

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists 00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou

More information


A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

A Conflict-Based Confidence Measure for Associative Classification

A Conflict-Based Confidence Measure for Associative Classification A Conflict-Based Confidence Measure for Associative Classification Peerapon Vateekul and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA

More information

A Generalized Method to Solve Text-Based CAPTCHAs

A Generalized Method to Solve Text-Based CAPTCHAs A Generalized Method to Solve Text-Based CAPTCHAs Jason Ma, Bilal Badaoui, Emile Chamoun December 11, 2009 1 Abstract We present work in progress on the automated solving of text-based CAPTCHAs. Our method

More information

Parallelizing String Similarity Join Algorithms

Parallelizing String Similarity Join Algorithms Parallelizing String Similarity Join Algorithms Ling-Chih Yao and Lipyeow Lim University of Hawai i at Mānoa, Honolulu, HI 96822, USA {lingchih,lipyeow}@hawaii.edu Abstract. A key operation in data cleaning

More information

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Minzhong Liu, Xiufen Zou, Yu Chen, Zhijian Wu Abstract In this paper, the DMOEA-DD, which is an improvement of DMOEA[1,

More information

Comprehensive and Progressive Duplicate Entities Detection

Comprehensive and Progressive Duplicate Entities Detection Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology

More information

A Fast Linkage Detection Scheme for Multi-Source Information Integration

A Fast Linkage Detection Scheme for Multi-Source Information Integration A Fast Linkage Detection Scheme for Multi-Source Information Integration Akiko Aizawa National Intsitute of Informatics / The Graduate University for Advanced Studies 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Evaluation of Meta-Search Engine Merge Algorithms

Evaluation of Meta-Search Engine Merge Algorithms 2008 International Conference on Internet Computing in Science and Engineering Evaluation of Meta-Search Engine Merge Algorithms Chunshuang Liu, Zhiqiang Zhang,2, Xiaoqin Xie 2, TingTing Liang School of

More information

Log Linear Model for String Transformation Using Large Data Sets

Log Linear Model for String Transformation Using Large Data Sets Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,

More information

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

An Ensemble Approach for Record Matching in Data Linkage

An Ensemble Approach for Record Matching in Data Linkage Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press

More information

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology Robust and Efficient Fuzzy Match for Online Data Cleaning S. Chaudhuri, K. Ganjan, V. Ganti, R. Motwani Presented by Aaditeshwar Seth 1 Motivation Data warehouse: Many input tuples Tuples can be erroneous

More information

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Efficient Lists Intersection by CPU- GPU Cooperative Computing Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative

More information

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites *

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Lijie Wang, Fei Liu, Ge Li **, Liang Gu, Liangjie Zhang, and Bing Xie Software Institute, School of Electronic Engineering

More information

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection Marnix de Bakker, Flavius Frasincar, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands

More information

Authorship Disambiguation and Alias Resolution in Data

Authorship Disambiguation and Alias Resolution in  Data Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Categorization of Sequential Data using Associative Classifiers

Categorization of Sequential Data using Associative Classifiers Categorization of Sequential Data using Associative Classifiers Mrs. R. Meenakshi, MCA., MPhil., Research Scholar, Mrs. J.S. Subhashini, MCA., M.Phil., Assistant Professor, Department of Computer Science,

More information



More information


AN INTERACTIVE FORM APPROACH FOR DATABASE QUERIES THROUGH F-MEASURE http:// AN INTERACTIVE FORM APPROACH FOR DATABASE QUERIES THROUGH F-MEASURE Parashurama M. 1, Doddegowda B.J 2 1 PG Scholar, 2 Associate Professor, CSE Department, AMC Engineering College, Karnataka, (India).

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information