A Learning Method for Entity Matching

Similar documents
Entity Resolution with Heavy Indexing

Similarity Joins of Text with Incomplete Information Formats

(Big Data Integration) : :

Active Blocking Scheme Learning for Entity Resolution

Metric and Identification of Spatial Objects Based on Data Fields

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

Comparative Study of Subspace Clustering Algorithms

Leveraging Transitive Relations for Crowdsourced Joins*

A KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE

Deduplication of Hospital Data using Genetic Programming

Leveraging Set Relations in Exact Set Similarity Join

Record Linkage using Probabilistic Methods and Data Mining Techniques

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

Ranking Web Pages by Associating Keywords with Locations

Sorted Neighborhood Methods Felix Naumann

Improving effectiveness in Large Scale Data by concentrating Deduplication

Similarity Joins in MapReduce

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

Sampling Selection Strategy for Large Scale Deduplication for Web Data Search

Towards a Domain Independent Platform for Data Cleaning

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

An Efficient Clustering Method for k-anonymization

Top-k Keyword Search Over Graphs Based On Backward Search

Parallel Similarity Join with Data Partitioning for Prefix Filtering

Joint Entity Resolution

Datasets Size: Effect on Clustering Results

Information Integration of Partially Labeled Data

An Iterative Approach to Record Deduplication

A Survey on Removal of Duplicate Records in Database

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Compression of the Stream Array Data Structure

A semi-incremental recognition method for on-line handwritten Japanese text

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

RiMOM Results for OAEI 2010

Learning Blocking Schemes for Record Linkage

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Survey of String Similarity Join Algorithms on Large Scale Data

RiMOM Results for OAEI 2009

Learning Probabilistic Ontologies with Distributed Parameter Learning

Texture Segmentation by Windowed Projection

LinkedMDB. The first linked data source dedicated to movies

Contour-Based Large Scale Image Retrieval

Rank Measures for Ordering

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Research Article Apriori Association Rule Algorithms using VMware Environment

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

A Survey on Postive and Unlabelled Learning

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

Discovering Advertisement Links by Using URL Text

Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Approximate String Joins

Distance-based Outlier Detection: Consolidation and Renewed Bearing

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

Dynamic Clustering of Data with Modified K-Means Algorithm

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Fast Algorithms for Top-k Approximate String Matching

Fast or furious? - User analysis of SF Express Inc

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

A fast approach for parallel deduplication on multicore processors

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation

Efficient Entity Matching over Multiple Data Sources with MapReduce

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

XML Data Stream Processing: Extensions to YFilter

MACHINE LEARNING BASED METHODOLOGY FOR TESTING OBJECT ORIENTED APPLICATIONS

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Efficient Common Items Extraction from Multiple Sorted Lists

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Link Prediction for Social Network

A Conflict-Based Confidence Measure for Associative Classification

A Generalized Method to Solve Text-Based CAPTCHAs

Parallelizing String Similarity Join Algorithms

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances

Comprehensive and Progressive Duplicate Entities Detection

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Ranking Clustered Data with Pairwise Comparisons

Evaluation of Meta-Search Engine Merge Algorithms

Log Linear Model for String Transformation Using Large Data Sets

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Entity Resolution, Clustering Author References

An Ensemble Approach for Record Matching in Data Linkage

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

Efficient Lists Intersection by CPU- GPU Cooperative Computing

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites *

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

Authorship Disambiguation and Alias Resolution in Data

Keywords Data alignment, Data annotation, Web database, Search Result Record

Categorization of Sequential Data using Associative Classifiers

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

AN INTERACTIVE FORM APPROACH FOR DATABASE QUERIES THROUGH F-MEASURE

Collective Entity Resolution in Relational Data

Transcription:

A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China 5500002@ecnu.cn, {cqjin, rzhang, ayzhou}@sei.ecnu.edu.cn ABSTRACT Entity matching that aims at finding similar records referring to the same entity is critical in various fields, such as data cleaning and data integration. There mainly exist two kinds of methods for entity matching, including the classification-based method and the rule-based method. The latter is more popular because it is highly scalable, easy-toimplement and explainable. However, selection of the proper thresholds, distance functions and rules has been one of the main challenges for the rule-based method. This paper focuses on devising a method to choose the proper rules automatically, as well as selecting the appropriate distance functions and thresholds. According to the training data, we define a metric to quantify the appropriateness of each rule, based on which a heuristic method is proposed. Experimental results on real data sets illustrate the high effectiveness and efficiency of our method.. INTRODUCTION Nowadays, data is vital in many fields, including finance, industry, medical and technology, etc. However, due to the various subjective and objective reasons such as the different schemas, mistaken spellings, missing data values, data stored in data sources is inaccurate, inconsistent and incomplete[9, 2, 4]. Under such circumstances, data quality is always a big issue. Poor data quality can impair the effectiveness of querying and analysis tasks in many applications, which makes it essential to take some measurements to clean and repair the incorrect data. An important task in data cleaning and integration is entity matching that aims at finding record pairs referring to the same entity in the real world. For example, Cora data set records a number of citations referring to different literatures. Table illustrates a small fraction of tuples in this data set, where four citations are referring to two different Corresponding author http://www.cs.umass.edu/~mccallum/data/cora-refs. tar.gz literatures. It is non-trivial to perform entity matching by analyzing the attributes since value in any attribute may be noisy and missing. For example, tuple and 2 refer to the same entity though they have different titles; tuple 2 and 3 refer to different entities although they share the same author and title. Entity matching has been widely studied for decades [6, 0,, 6, 2, 5, 8, 4, 7, 5, 3, 7, 3, 8]. Existing work can be mainly classified into two categories, including the classification-based method and the rule-based method [7]. The former extracts feature vectors by using statistics, machine learning and artificial intelligence techniques, based on which we can classify the record pairs as matching or non-matching [, 6]. The latter aims at building rules to match records [5, 7, 5]. For instance, two citations in the Cora data set with small distance in author, title and date attributes can be roughly treated as referring to the same literature. In general, the rule-based method is easier to implement and scale out than the classification-based method, making it more widely used in real applications. One significant challenge of the rule-based method is d- ifficult to choose proper thresholds, distance functions and rules. Firstly, it is hard to set a proper threshold under a given distance function. For example, in Table, is it sufficient to claim that two citations refer to the same literature if their edit distance in title attribute is below 0.3? If not, how about 0.? Secondly, there exist many distance functions to measure the distance between the corresponding attribute of two citations, such as Jaccard distance, edit distance, Cosine distance, q-gram distance and so on. How to choose a proper distance function for each attribute is difficult. Thirdly, since there are multiple rules and each rule contains a group of attributes, it is quite hard to decide which rule is proper for entity matching. Although such three factors can be decided by experience, it is much challenging when the volume of the data set is quite huge. The most related work to our paper is SiFi [7]. Given several rules and a training data set, this method uses climbing algorithm to find optimal similarities and thresholds for each rule. However, this approach requires to obtain some appropriate rules in advance, without which the performance may decrease dramatically. Contrarily, our goal is to generate proper rules automatically, as well as choosing proper distance functions and setting thresholds through a training data set. It is interesting to decide the proper threshold for any distance function. Here, we simply assume two tuples referring to the same entity if their distance is below a given threshold

Table : An example in Cora dataset Record Entity ID ID Author Title Address Date carla e. brodley and paul e. utgoff. multivariate versus univariate decision trees. amherst ma 992. 2 c. e. brodley and p. e. utgoff. multivariate decision trees. amherst massachusetts 992. 3 c. e. brodley and p. e. utgoff. multivariate decision trees. NULL 995. 2 4 carla e. brodley and paul e. utgoff. multivariate decision trees. NULL 995. value. However, when such threshold value is set so small, too many matching pairs may be ignored. Otherwise if the threshold value is set relatively big, some non-matching pairs may be mistreated as matching pairs. It is worth noting that precision and recall are insufficient to measure the appropriateness of a threshold, i.e, the former situation means high precision and low recall, while the latter means low precision and high recall. Hence, we use F-score instead, which is the harmonic mean of precision and recall. Note that F-Score is the special case of F β measure. Parameter β balances the weight of precision and recall and it equals to in F-score. If a list of possible terrorists is crossed with the list of passengers of a given flight, it should be better to have a higher recall than precision and we can set β great than. Another example is the marketing campaign, it should be better to have a higher precision than recall and we can set β less than. Thus we can tune the parameter β according to the different applications. In our paper, we treat precision and recall equally. We review these three measures below. precision = recall = F-score = # of correctly identified duplicate pairs # of identified duplicate pairs # of correctly identified duplicate pairs # of true duplicate pairs 2 precision recall precision + recall Given a training data set, we treat F-score as the target measure to find a proper threshold with maximum F-score for any distance function. Meanwhile, in any attribute, the distance function with maximum F-score is chosen as the proper distance function; the attribute with maximum F- score is treated as the proper attribute. Moreover, since multiple attributes may take cooperative effects for entity matching, we further explore a complicated method which groups multiple attributes to measure entity similarity. The group of attributes with maximum F-score is chosen as the proper rule for entity matching. In this way, we can decide the thresholds, distance functions and rules automatically. Hence, any record pair with distance below the proper threshold by using the proper distance function under the proper rule is treated as the same entity. It is worth noting that our new method also fully consider the different capability of each attribute for entity matching. To summarize, we make the following contributions. We define a new concept based on F-score to measure the appropriateness of how to set the proper thresholds, how to choose the proper distance functions and rules. We propose a new method to find proper thresholds, distance functions and rules automatically, which overcomes a major challenge of the rule-based entity matching. The experiment results on real data sets illustrate the high efficiency and effectiveness of our method. The rest of the paper is organized as follows. We define the problem formally in Section 2. In the next section, we propose our approach to solve entity matching and devise a heuristic algorithm to find an appropriate rule for entity matching. Section 4 reports the experimental results. Section 5 reviews the related work. Finally, we conclude this paper briefly in the last section. 2. PROBLEM DEFINITION Let E[A, A 2,, A k ] be a relation containing n tuples, denoted as e, e 2,, e n. Here, A, A 2,, A k denotes k attributes in E. There exist various distance functions to describe the distance between a pair of attribute values. Typical distance functions include edit distance, Jaro distance, Jaccard distance, Cosine distance, Euclidean distance, q- gram distance [, 2, 20], soundex distance [9] and so on. Let D = {dis, dis 2,, dis m} denote m distance functions. For any dis D, let dis(e i.a, e j.a) denote the distance of the attribute A for any tuple pair (e i, e j) by using dis function. Given a distance function dis and an attribute A, we first define two concepts, namely Maximum F-score and Proper Threshold. The Maximum F-score describes the maximum F-score in attribute A by using function dis, while the Proper Threshold describes a threshold where the Maximum F- score is achieved. Both values can be learned by using a training data set. Definition (Maximum F-score). Given a distance function dis and an attribute A, we define the Maximum F-score, MAXF dis,a (E), as the maximum F-Score in the attribute A for a relation E by using function dis. Definition 2 (Proper Threshold). Given a distance function dis and an attribute A, we define the Proper Threshold, PT dis,a (E), as a distance threshold where the Maximum F-Score MAXF dis,a (E) exists. Since there exist multiple distance functions, we need to check which distance function is the most proper one for each atribute. For example, table 2 illustrates the Maximum F- score and Proper threshold, as well as precision and recall by comparing the Jaccard distance and edit distance for the title attribute in Cora data set. In this case, the edit distance is more powerful than Jaccard distance because of a greater MAXF dis,a (E) value. Hence, we define the attribute-level Maximum F-score for any attribute A below. Definition 3 (Attribute-level Maximum F-score). Given an attribute A, the attribute-level Maximum F-score of a relation E, MAXF A(E), is the maximum MAXF dis,a (E) for all distance functions. MAXF A(E) = max dis D (MAXF dis,a(e))

Table 2: MAXF and PT for title in Cora data set Distance Function MAXF PT Precision Recall Jaccard distance 0 0.69 0.73 8 Edit distance 0.94 0.28 0.94 0.95 Table 3: MAXF, PT and PD in Cora data set Attribute MAXF Proper Threshold Proper Distance Author 0.60 0.32 q-gram distance Title 0.94 0.28 edit distance Address 0.3 0.45 q-gram distance Date 0.33 0.50 Jaro distance Pages 0.65 0.5 q-gram distance Volume 0.53 0.67 q-gram distance Publisher 0. 0.56 q-gram distace Editor 0.20 0.60 q-gram distance Journal 0.37 0.23 q-gram distance At the same time, we also define the attribute-level Proper Threshold, PT A, as the threshold used for the attribute-level maximum F-score; define the attribute-level Proper Distance function, PD A, as the distance function used for the attributelevel maximum F-score. As there exists multiple distance functions and multiple attributes, it is interesting to find the best distance function and attribute for entity matching. Hence, we define the relation-level Maximum F-score as the maximum attributelevel Maximum F-score, as shown below. Definition 4 (Relation-level Maximum F-score). We define the relation-level Maximum F-score of a relation E, MAXF(E), as the maximum MAXF A(E) for all attributes (say, A). MAXF (E) = max(maxf A(E)) A E Correspondingly, relation-level Proper Threshold (P T (E)) and relation-level Proper Distance function (P D(E)) can be defined in a similar way. Now, we can claim a tuple pair referring to the same entity if their distance computed by P D(E) is below P T (E). Otherwise, the pair refers to different entities. Table 3 illustrates the attribute-level MAXF, P T A and P D A for Cora data set. We can also observe that the relation-level MAXF is 0.94, achieved by edit distance(p D(E)) upon title attribute and P T (E) is 0.28. An interesting but intuitive observation is that using multiple attributes to match a pair of entities is superior to only using single attribute since every attribute provide information in different aspect. Consequently, we also use a group of selected attributes for entity matching at the same time. We take Table as an example. If we only use single title attribute to find matching records, we will unavoidably mistreat tuple 2 and 3 as a matching pair. However, if we use the group of title and date attributes to find matching records, it is possible to treat them as a non-matching pair. We now show how to define the distance when using a group of attributes. Definition 5 (Attributes Group Distance). Let G denote an attribute group in E, G = {A (), A (2),, A ( G ) }, Table 4: Attribute Group in Cora data set Attribute Group MAXF Author 0.60 Volume 0.64 Author, Volume 0.70 Author, Volume, Page 0.73 Author, Volume, Page, Journal 0.70 G E. Given two tuples e i and e j, the attribute group distance is computed below. ( ) G MAXF GD G(e i, e j) = A (l) G l= h= MAXF P D A (l)(e i.a (l), e j.a (l) ) A (h) Remember that MAXF A (l) and PD A (l) describe the attributelevel maximum F-score and proper distance function for the attribute A (l) respectively. Since each attribute has different capability for entity matching, we treat MAXF A (l) as the weight of each proper distance function in attribute A (l). In general, higher MAXF value means more powerful capability in undertaking this task. Given an attribute group G, we define Group Maximum F-Score as follows. Definition 6 (Group Maximum F-score). Given an attribute group G, the Group Maximum F-score, MAXF G, is the maximum F-score achieved under G. Example. Table 4 illustrates a small example in Cora data set. Using group of author, volume and page attributes together achieves better performance. For a data set containing k attributes, there exist 2 k attribute groups (including the group only having one attribute), each with different group maximum F-score value. We define the Proper Group as the group with maximum value of MAXF G. Definition 7 (Proper Group). Given a relation E, we define the Proper Group, P G(E), as the group in E which returns maximum value of MAXF G for all attribute groups. It is worth mentioning that every attribute group stands for a rule and the proper group corresponds to an optimal rule for entity matching. Meanwhile our rule fully consider the different capabilities of attributes and don t just treat them equally. Since acquiring the proper group for a relation is costly, we devise an approximate method for this issue, as described in the next section. 3. OUR SOLUTION In this section, we begin to describe our solution in detail. The framework contains two phases, namely training phase and testing phase, as shown below.. Training phase: The goal in this phase is to find an attribute group with a maximum MAXF value by using a training set, denoted as Ĝ. 2. Testing phase: For any tuple pair in a relation E, we compute the distance between two tuples by using the proper distance functions for Ĝ. If the distance is below the proper threshold, return matching symbol. Otherwise, return non-matching symbol.

It is expensive to find the proper group when the value of k is great, since it requires to check 2 k attribute groups one by one in the training phase. To reduce the processing cost, we design a heuristic method instead, avoiding to test too many attribute groups. For example, let G and G denote two attribute groups, G G =. Let G denote the union of these two groups, i.e, G = G G. We then demonstrate how to test the necessariness of combining G with other groups in future. First, we compute MAXF G, MAX G and MAX G for these three groups. If MAXF G is greater than the other two values, G is treated as a potential group, and it will be used again in future. Otherwise, if MAXF G is smaller than any value, we treat group G has low probability to be a part of the final proper group. Hence, we will not generate any group based on G in future. In this way, we only test a small number of attribute groups, and experimental results show that our result is still quite close to the optimal situation. Algorithm getattributegroup (Algorithm ) illustrates detailed steps to get an appropriate attribute group for E. Parameter c limits the maximum number of attributes in each group. h, h c, the variable S h denotes a set containing all candidate attribute groups in which the number of attributes is h. The set S contains all candidate attribute groups. First, we test every attribute in E, and insert all triples like (MAXF A, P D A, P T A) into S (at Lines 2-6). Subsequently, we generate the candidate set S h incrementally by invoking subroutine getcandidategroups. If no more candidate groups are generated, we have finished creating S (at Lines 7-4). Finally, this algorithm returns a group in S with maximum MAXFĜ value (at Line 5). Algorithm getcandidategroups (Algorithm 2) describes how to generate new candidate attribute groups when given three parameters, S, S and h. In general, S is a set containing all the single attributes, while S contains all candidate attribute groups which has h attributes. The goal is to generate a candidate set where the number of attributes in each group is h. As introduced above, we test every union of G G where G S and G S. The new group will be inserted into the candidate set only when its MAXF value is greater than that of G and G (at Lines 2-7). Algorithm getattributegroup(e, c) : S = ; 2: for each attribute A in E do 3: Compute MAXF A, P D A and P T A; 4: S S {(MAXF A, P D A, P T A)}; 5: S S ; 6: end for 7: for attribute group size h = 2 to c do 8: S h getcandidategroups(s, S h, h); 9: if S h = then 0: break; : else 2: S S S h ; 3: end if 4: end for 5: Let Ĝ be a group in S with maximum MAXF Ĝ value; 6: return Ĝ; 4. EXPERIMENTS Algorithm 2 getcandidategroups(s, S, h) : S h = ; 2: for each G S, G S do 3: G = G G 4: if (MAXF G > max(maxf G, MAXF G ) then 5: S h S h {(MAXF G, P D G, P T G)}; 6: end if 7: end for 8: return S h ; 4. Experiment Setup We wrote the codes in Java and conducted the experiments in a Windows system. The CPU is Intel Core 2.0GHz and the physical memory is 2GB. We use two real data sets: Cora data set and Restaurant data set 2. The Cora data set contains 876 citations of 9 literatures and more than ten attributes. We extract nine attributes: author, title, address, date, page, volume, publisher, editor and journal. The Restaurant data set contains 864 tuples which records the basic information of restaurants. There exist four attributes: name, address, city and type and all of them are used. We use 0-fold cross-validation to evaluate our method. 4.2 Single-Attribute Performance We select five distance functions in the following tests, including Cosine distance, Jaccard distance, Jaro distance, edit distance and q-gram distance. Figure and 2 show the maximum F-scores for all attributes in Cora and Restaurant data sets respectively. We can observe that each attribute has different capability for entity matching. Here, the title attribute by using edit distance is the best choice for Cora data set, and the name attribute by using q-gram distance is the best choice for Restaurant data set. Indeed, we can find that single attribute title can already achieve great F-score in Cora data set and distance functions show little difference in F-scores for short strings. Meanwhile, We can see that the q-gram distance behaves well in most of attributes. Maximal F-Score 0.9 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 Cosine Jaccard Jaro Edit Distance q-gram Author Title Address Date Page VolumePublisher Editor Journal Figure : The MAXF for the Cora data set 4.3 Attribute Group Performance Now we want to evaluate the performance of attribute group. Table 5 and 6 illustrate the top-5 attribute groups we have obtained from training data for Cora and Restaurant data sets respectively. 2 http://www.cs.utexas.edu/users/ml/riddle/data

Maximal F-Score 0.9 0.7 0.6 0.5 0.4 0.3 0.2 Cosine Jaccard Jaro Edit Distance q-gram Precision 0.6 0.4 0.2 Best Attribute Group of Cora DataSet Best Attribute Group of Restaurant DataSet 0. 0 Name Address City Type 0 0 0.2 0.4 0.6 Recall Figure 2: The MAXF for the Restaurant data set In Cora data set, the best attribute group with weighting of each attribute which achieves maximum MAXF G in training data is the group of author, title, address and journal attributes. We can find the top-3 groups can achieve maximum F-score values around 0.94 in both training data and test data. The maximum F-score values of 4th and 5th groups in training data is relatively low. Consequently, they are also lower than those of the top-3 groups in test data, which demonstrates our method is stable. Meanwhile, we can observe that the top-5 attribute groups without weighting is not satisfactory. Because of ignoring the different capabilities among attributes, it will produce less proper attribute groups. In Restaurant data set, the best attribute group with weighting of each attribute is the group of name and address attributes. We can also find that the maximum F-score value of the 5th attribute group is lower than those of the top-4 attribute groups in both training data and test data. However, due to few matching record pairs in training data, there exists a deviation of maximum F-score values between training data and test data. Also we can observe that the top-5 attribute groups without weighting is not satisfactory. Figure 3 is the recall-precision curve under the best attribute group for Cora and Restaurant data set. If the distance threshold is set high, we get a big value in recall and a small value in precision. Contrarily, if the distance threshold value is set low, we get a big value in precision and a small value in recall. Both situations cannot meet the requirements in real applications. Fortunately, there still exist appropriate threshold values so that both values are great (at the top-right corner) by using our method. In fact, under such situations, the F-score value is approximately maximized. Moreover, the result for the Cora data set is more effective than that for the Restaurant data set since the former has a greater MAXF G value. 4.4 Efficiency of the Heuristic Algorithm Now we want to evaluate the performance of our heuristic algorithm introduced in Section 3. We don t limit the attribute group size and set the argument maximal number of attributes to the number of all attributes. We compared it with Naive algorithm which enumerates all the attribute groups and then select the proper attribute group which has maximum F-score. Figure 4 gives the runtime which is used to find a best attribute group in training data for Cora data set by using heuristic and naive approach. Notice that the vertical axis Figure 3: The Recall-Precision Curve under the best attribute group is the power of 0 and horizontal axis is the number of attributes which is used in current test. With the increasing number of attributes, we can find our heuristic algorithm outperforms the naive algorithm. This is because we have pruned so many groups which has the low probability to be the propel attribute group, avoiding to consider them further. Runtime(milliseconds) 00000 0000 000 00 0 k=9 k=8 k=7 k=6 k=5 k=4 k=3 k=2 k= Attribute Number k Naive Algorithm Heuristic Algorithm Figure 4: Comparison of Runtime under different k 4.5 Comparison with Existing Techniques Finally, we compare our method with Op-Trees [2] and SiFi-Hill [7]. We split the Cora and Restaurant data set into 2 folds for cross-validation. Figure 5 shows the final results. We can observe that our method outperforms the Op-Trees and SiFi-Hill. We can achieve the highest F-score since we seek to achieve maximum F-score under the appropriate attribute group. However, SiFi-Hill is strong dependent on given rules. Once the rules is chosen improperly, this method can never achieve good results. Our method can automatically select the appropriate rules. Meanwhile, as proposed in [7], Op-Trees does not consider the redundancy among similarity functions and SVM consumes so much time for entity matching. 5. RELATED WORK As is mentioned in the introduction, there mainly exist two categories to solve the problem of entity matching. The first category is classification-based method. Bilenko et al. [] propose to use SVM classifier to solve it and achieve better results. In the attribute level, it considers two groups of

Table 5: Top-5 attribute groups for Cora data set Rank Attribute Group With Weighting Training MaxF Testing MaxF Author, Title, Address, Journal 0.94 0.94 2 Title, Volume 0.94 0.95 3 Title 0.94 0.93 4 Author, Address, Page, Volume, Editor 0.74 0.7 5 Author, Address, Date, Page, Publisher, Editor, Journal 0.74 0.7 Rank Attribute Group Without Weighting Training MaxF Testing MaxF Title 0.94 0.93 2 Author, Date, Page, Volume, Editor 0.73 0.63 3 Author, Page, Volume, Publisher, Editor, Journal 0.73 0.6 4 Author, Page, Volume, Editor 0.73 0.6 5 Author, Page, Volume 0.73 0.6 Table 6: Top-5 attribute groups for Restaurant data set Rank Attribute Group Training Testing Attribute Group Training Testing Rank With Weighting MAXF MaxF Without Weighting MAXF MaxF Name, Address 0.98 0.9 Name, Address 0.98 0.9 2 Name, Address, City 0.98 8 2 Name 0.9 3 3 Name, Address, City, Type 0.97 0.90 3 Address 8 0.63 4 Name, Address, Type 0.97 0.90 4 City, Type 0.4 0.0 5 Name, City, Type 0.92 2 5 Type 0.09 0.003 Maximal F-Score 0.95 0.9 5 0.75 0.7 Cora Dataset Restaurant Dataset Op-Trees SiFi-Hill Our_Method SVM Figure 5: Comparison with existing techniques similarity functions: character-based and vector-based similarity functions. Character-based group mainly uses the edit distance while the vector-based group mainly uses Cosine distance. In the entity level, it generates m k features, where m is the number of attributes and k is number of similarity functions. By acquiring the feature vector, SVM is used to train the classifier. This approach has high accuracy. However, Wang et al. [7] show that this approach consumes much time in testing data and is difficult to explain and scale out. The rule-based method is explainable and scalable. Wang et al. [7] first propose the problem of how similar is similar. By given a labeled data set, they observe that different similarity functions and thresholds have redundancy. So they first prune redundant similarity functions and thresholds, and then devise three efficient algorithms to find matching entities. SiFi-Hill outperforms SiFi-Greedy and SiFi-Gradient as it considers the factor of attributes dependence. Relatively speaking, our approach can find a proper rule, namely proper attribute group, as well as proper threshold and proper distance function. Moreover, we can more quickly find the proper threshold and proper distance function since we needn t to iterate to find an optimal one. Besides, Chaudhuri et al. [2] propose another explainable technique for record matching. This method constructs operator tree to solve this problem. Guo et al. [7] novelly propose k-partite graph clustering to solve it. It believes each attribute value in the same entity should overlap. Hence, each node in k-partite graph represents a value of each attribute. There exists an edge between two nodes if the value of each attribute is stored in the same record. After clustering of k-partite, it can perform entity matching and data fusion simultaneously. In recent years, the research work also extends the entity matching issue to temporal data and transactional data. Li et al. [3] apply time decay to consider the effect on time for entity matching. The main idea is that the same entity will change attribute s value over a long time and different entities also share the same attribute s value in a period time. Therefore, when we consider whether the entity is matched or not, we should take the time decay as a weight to calculate the similarity of entity. Yakout et al. [8] present a new entity matching approach that uses entity behavior to merge the transactional data. The main idea is that if the entity behavior becomes more regular and stable when merge two transactional records, it is more likely that these two records refer to the same entity. This method also introduces a phase of candidate generation to reduce most non-matching pairs. 6. CONCLUSION In this paper, we present a new learning method for entity matching. At first, we define a new concept based on F-score to find proper threshold, proper distance for each attribute and proper group for the relation. Then we propose our approach to solve entity matching and devise a heuristic algorithm to find the appropriate attribute group. The experiments on two real data sets show our method is efficient, effective and stable. 7. ACKNOWLEDGEMENT The research of Cheqing Jin is supported by the National Basic Research Program of China (Grant No. 202CB36200),

the Key Program of National Natural Science Foundation of China (Grant No. 6093300), National Natural Science Foundation of China (Grant No. 6070052). The research of Aoying Zhou is supported by National Science Foundation for Distinguished Young Scholars (Grant No. 60925008), and Natural Science Foundation of China (No.602004). 8. REFERENCES [] M. Bilenko and R. J. Mooney. Adadptive duplicate detection using learnable string similarity measures. In ACM SIGKDD, pages 39 48, Washington, 2003. ACM. [2] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages 327 338, Vienna, 2007. ACM. [3] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages 327 338, Vienna, 2007. ACM. [4] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 9(): 6, January 2007. [5] W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2():407 48, 2009. [6] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):83 20, 969. [7] S. Guo, X. Dong, D. Srivastava, and R. Zajac. Record linkage with uniqueness constraints and erroneous values. PVLDB, 3(0):47 428, 200. [8] M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. Sigmod Record, 24(2):27 38, 995. [9] D. O. Holmes and M. C. Mccabe. Improving precision and recall for soundex retrieval. In Proceedings of International Symposium on Information Technology, pages 22 26. IEEE, 995. [0] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithems. In ACM SIGMOD, pages 802 203, New York, 2006. ACM. [] H. Lee, R. T. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In VLDB, pages 95 206. ACM, 2007. [2] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303 34. ACM, 2007. [3] P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4():956 967, 20. [4] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. In ICDE, pages 294 30, Washington, 993. IEEE. [5] V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. Large-scale collective entity matching. PVLDB, 4(4):208 28, 20. [6] S. Sarawagi and A. Bhamidipaty. Iterative deduplication using active learning. In ACM SIGKDD, pages 269 278, New York, 2002. ACM. [7] J. Wang, G. Li, J. X. Yu, and J. Fen. Entity matching: How similar is similar. PVLDB, 0(20):622 633, 4. [8] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi. Behavior based record linkage. PVLDB, 3():439 448, 200. [9] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279 289, February 20. [20] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In ACM SIGMOD, pages 353 364. ACM, 2008. [2] Y. C. Yuan. Multiple imputation for missing data: Concepts and new development. In In the 25th Annual SAS Users Group International Conference, 2002.