A Learning Method for Entity Matching
|
|
- Lucas Anderson
- 6 years ago
- Views:
Transcription
1 A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China {cqjin, rzhang, ABSTRACT Entity matching that aims at finding similar records referring to the same entity is critical in various fields, such as data cleaning and data integration. There mainly exist two kinds of methods for entity matching, including the classification-based method and the rule-based method. The latter is more popular because it is highly scalable, easy-toimplement and explainable. However, selection of the proper thresholds, distance functions and rules has been one of the main challenges for the rule-based method. This paper focuses on devising a method to choose the proper rules automatically, as well as selecting the appropriate distance functions and thresholds. According to the training data, we define a metric to quantify the appropriateness of each rule, based on which a heuristic method is proposed. Experimental results on real data sets illustrate the high effectiveness and efficiency of our method.. INTRODUCTION Nowadays, data is vital in many fields, including finance, industry, medical and technology, etc. However, due to the various subjective and objective reasons such as the different schemas, mistaken spellings, missing data values, data stored in data sources is inaccurate, inconsistent and incomplete[9, 2, 4]. Under such circumstances, data quality is always a big issue. Poor data quality can impair the effectiveness of querying and analysis tasks in many applications, which makes it essential to take some measurements to clean and repair the incorrect data. An important task in data cleaning and integration is entity matching that aims at finding record pairs referring to the same entity in the real world. For example, Cora data set records a number of citations referring to different literatures. Table illustrates a small fraction of tuples in this data set, where four citations are referring to two different Corresponding author tar.gz literatures. It is non-trivial to perform entity matching by analyzing the attributes since value in any attribute may be noisy and missing. For example, tuple and 2 refer to the same entity though they have different titles; tuple 2 and 3 refer to different entities although they share the same author and title. Entity matching has been widely studied for decades [6, 0,, 6, 2, 5, 8, 4, 7, 5, 3, 7, 3, 8]. Existing work can be mainly classified into two categories, including the classification-based method and the rule-based method [7]. The former extracts feature vectors by using statistics, machine learning and artificial intelligence techniques, based on which we can classify the record pairs as matching or non-matching [, 6]. The latter aims at building rules to match records [5, 7, 5]. For instance, two citations in the Cora data set with small distance in author, title and date attributes can be roughly treated as referring to the same literature. In general, the rule-based method is easier to implement and scale out than the classification-based method, making it more widely used in real applications. One significant challenge of the rule-based method is d- ifficult to choose proper thresholds, distance functions and rules. Firstly, it is hard to set a proper threshold under a given distance function. For example, in Table, is it sufficient to claim that two citations refer to the same literature if their edit distance in title attribute is below 0.3? If not, how about 0.? Secondly, there exist many distance functions to measure the distance between the corresponding attribute of two citations, such as Jaccard distance, edit distance, Cosine distance, q-gram distance and so on. How to choose a proper distance function for each attribute is difficult. Thirdly, since there are multiple rules and each rule contains a group of attributes, it is quite hard to decide which rule is proper for entity matching. Although such three factors can be decided by experience, it is much challenging when the volume of the data set is quite huge. The most related work to our paper is SiFi [7]. Given several rules and a training data set, this method uses climbing algorithm to find optimal similarities and thresholds for each rule. However, this approach requires to obtain some appropriate rules in advance, without which the performance may decrease dramatically. Contrarily, our goal is to generate proper rules automatically, as well as choosing proper distance functions and setting thresholds through a training data set. It is interesting to decide the proper threshold for any distance function. Here, we simply assume two tuples referring to the same entity if their distance is below a given threshold
2 Table : An example in Cora dataset Record Entity ID ID Author Title Address Date carla e. brodley and paul e. utgoff. multivariate versus univariate decision trees. amherst ma c. e. brodley and p. e. utgoff. multivariate decision trees. amherst massachusetts c. e. brodley and p. e. utgoff. multivariate decision trees. NULL carla e. brodley and paul e. utgoff. multivariate decision trees. NULL 995. value. However, when such threshold value is set so small, too many matching pairs may be ignored. Otherwise if the threshold value is set relatively big, some non-matching pairs may be mistreated as matching pairs. It is worth noting that precision and recall are insufficient to measure the appropriateness of a threshold, i.e, the former situation means high precision and low recall, while the latter means low precision and high recall. Hence, we use F-score instead, which is the harmonic mean of precision and recall. Note that F-Score is the special case of F β measure. Parameter β balances the weight of precision and recall and it equals to in F-score. If a list of possible terrorists is crossed with the list of passengers of a given flight, it should be better to have a higher recall than precision and we can set β great than. Another example is the marketing campaign, it should be better to have a higher precision than recall and we can set β less than. Thus we can tune the parameter β according to the different applications. In our paper, we treat precision and recall equally. We review these three measures below. precision = recall = F-score = # of correctly identified duplicate pairs # of identified duplicate pairs # of correctly identified duplicate pairs # of true duplicate pairs 2 precision recall precision + recall Given a training data set, we treat F-score as the target measure to find a proper threshold with maximum F-score for any distance function. Meanwhile, in any attribute, the distance function with maximum F-score is chosen as the proper distance function; the attribute with maximum F- score is treated as the proper attribute. Moreover, since multiple attributes may take cooperative effects for entity matching, we further explore a complicated method which groups multiple attributes to measure entity similarity. The group of attributes with maximum F-score is chosen as the proper rule for entity matching. In this way, we can decide the thresholds, distance functions and rules automatically. Hence, any record pair with distance below the proper threshold by using the proper distance function under the proper rule is treated as the same entity. It is worth noting that our new method also fully consider the different capability of each attribute for entity matching. To summarize, we make the following contributions. We define a new concept based on F-score to measure the appropriateness of how to set the proper thresholds, how to choose the proper distance functions and rules. We propose a new method to find proper thresholds, distance functions and rules automatically, which overcomes a major challenge of the rule-based entity matching. The experiment results on real data sets illustrate the high efficiency and effectiveness of our method. The rest of the paper is organized as follows. We define the problem formally in Section 2. In the next section, we propose our approach to solve entity matching and devise a heuristic algorithm to find an appropriate rule for entity matching. Section 4 reports the experimental results. Section 5 reviews the related work. Finally, we conclude this paper briefly in the last section. 2. PROBLEM DEFINITION Let E[A, A 2,, A k ] be a relation containing n tuples, denoted as e, e 2,, e n. Here, A, A 2,, A k denotes k attributes in E. There exist various distance functions to describe the distance between a pair of attribute values. Typical distance functions include edit distance, Jaro distance, Jaccard distance, Cosine distance, Euclidean distance, q- gram distance [, 2, 20], soundex distance [9] and so on. Let D = {dis, dis 2,, dis m} denote m distance functions. For any dis D, let dis(e i.a, e j.a) denote the distance of the attribute A for any tuple pair (e i, e j) by using dis function. Given a distance function dis and an attribute A, we first define two concepts, namely Maximum F-score and Proper Threshold. The Maximum F-score describes the maximum F-score in attribute A by using function dis, while the Proper Threshold describes a threshold where the Maximum F- score is achieved. Both values can be learned by using a training data set. Definition (Maximum F-score). Given a distance function dis and an attribute A, we define the Maximum F-score, MAXF dis,a (E), as the maximum F-Score in the attribute A for a relation E by using function dis. Definition 2 (Proper Threshold). Given a distance function dis and an attribute A, we define the Proper Threshold, PT dis,a (E), as a distance threshold where the Maximum F-Score MAXF dis,a (E) exists. Since there exist multiple distance functions, we need to check which distance function is the most proper one for each atribute. For example, table 2 illustrates the Maximum F- score and Proper threshold, as well as precision and recall by comparing the Jaccard distance and edit distance for the title attribute in Cora data set. In this case, the edit distance is more powerful than Jaccard distance because of a greater MAXF dis,a (E) value. Hence, we define the attribute-level Maximum F-score for any attribute A below. Definition 3 (Attribute-level Maximum F-score). Given an attribute A, the attribute-level Maximum F-score of a relation E, MAXF A(E), is the maximum MAXF dis,a (E) for all distance functions. MAXF A(E) = max dis D (MAXF dis,a(e))
3 Table 2: MAXF and PT for title in Cora data set Distance Function MAXF PT Precision Recall Jaccard distance Edit distance Table 3: MAXF, PT and PD in Cora data set Attribute MAXF Proper Threshold Proper Distance Author q-gram distance Title edit distance Address q-gram distance Date Jaro distance Pages q-gram distance Volume q-gram distance Publisher q-gram distace Editor q-gram distance Journal q-gram distance At the same time, we also define the attribute-level Proper Threshold, PT A, as the threshold used for the attribute-level maximum F-score; define the attribute-level Proper Distance function, PD A, as the distance function used for the attributelevel maximum F-score. As there exists multiple distance functions and multiple attributes, it is interesting to find the best distance function and attribute for entity matching. Hence, we define the relation-level Maximum F-score as the maximum attributelevel Maximum F-score, as shown below. Definition 4 (Relation-level Maximum F-score). We define the relation-level Maximum F-score of a relation E, MAXF(E), as the maximum MAXF A(E) for all attributes (say, A). MAXF (E) = max(maxf A(E)) A E Correspondingly, relation-level Proper Threshold (P T (E)) and relation-level Proper Distance function (P D(E)) can be defined in a similar way. Now, we can claim a tuple pair referring to the same entity if their distance computed by P D(E) is below P T (E). Otherwise, the pair refers to different entities. Table 3 illustrates the attribute-level MAXF, P T A and P D A for Cora data set. We can also observe that the relation-level MAXF is 0.94, achieved by edit distance(p D(E)) upon title attribute and P T (E) is An interesting but intuitive observation is that using multiple attributes to match a pair of entities is superior to only using single attribute since every attribute provide information in different aspect. Consequently, we also use a group of selected attributes for entity matching at the same time. We take Table as an example. If we only use single title attribute to find matching records, we will unavoidably mistreat tuple 2 and 3 as a matching pair. However, if we use the group of title and date attributes to find matching records, it is possible to treat them as a non-matching pair. We now show how to define the distance when using a group of attributes. Definition 5 (Attributes Group Distance). Let G denote an attribute group in E, G = {A (), A (2),, A ( G ) }, Table 4: Attribute Group in Cora data set Attribute Group MAXF Author 0.60 Volume 0.64 Author, Volume 0.70 Author, Volume, Page 0.73 Author, Volume, Page, Journal 0.70 G E. Given two tuples e i and e j, the attribute group distance is computed below. ( ) G MAXF GD G(e i, e j) = A (l) G l= h= MAXF P D A (l)(e i.a (l), e j.a (l) ) A (h) Remember that MAXF A (l) and PD A (l) describe the attributelevel maximum F-score and proper distance function for the attribute A (l) respectively. Since each attribute has different capability for entity matching, we treat MAXF A (l) as the weight of each proper distance function in attribute A (l). In general, higher MAXF value means more powerful capability in undertaking this task. Given an attribute group G, we define Group Maximum F-Score as follows. Definition 6 (Group Maximum F-score). Given an attribute group G, the Group Maximum F-score, MAXF G, is the maximum F-score achieved under G. Example. Table 4 illustrates a small example in Cora data set. Using group of author, volume and page attributes together achieves better performance. For a data set containing k attributes, there exist 2 k attribute groups (including the group only having one attribute), each with different group maximum F-score value. We define the Proper Group as the group with maximum value of MAXF G. Definition 7 (Proper Group). Given a relation E, we define the Proper Group, P G(E), as the group in E which returns maximum value of MAXF G for all attribute groups. It is worth mentioning that every attribute group stands for a rule and the proper group corresponds to an optimal rule for entity matching. Meanwhile our rule fully consider the different capabilities of attributes and don t just treat them equally. Since acquiring the proper group for a relation is costly, we devise an approximate method for this issue, as described in the next section. 3. OUR SOLUTION In this section, we begin to describe our solution in detail. The framework contains two phases, namely training phase and testing phase, as shown below.. Training phase: The goal in this phase is to find an attribute group with a maximum MAXF value by using a training set, denoted as Ĝ. 2. Testing phase: For any tuple pair in a relation E, we compute the distance between two tuples by using the proper distance functions for Ĝ. If the distance is below the proper threshold, return matching symbol. Otherwise, return non-matching symbol.
4 It is expensive to find the proper group when the value of k is great, since it requires to check 2 k attribute groups one by one in the training phase. To reduce the processing cost, we design a heuristic method instead, avoiding to test too many attribute groups. For example, let G and G denote two attribute groups, G G =. Let G denote the union of these two groups, i.e, G = G G. We then demonstrate how to test the necessariness of combining G with other groups in future. First, we compute MAXF G, MAX G and MAX G for these three groups. If MAXF G is greater than the other two values, G is treated as a potential group, and it will be used again in future. Otherwise, if MAXF G is smaller than any value, we treat group G has low probability to be a part of the final proper group. Hence, we will not generate any group based on G in future. In this way, we only test a small number of attribute groups, and experimental results show that our result is still quite close to the optimal situation. Algorithm getattributegroup (Algorithm ) illustrates detailed steps to get an appropriate attribute group for E. Parameter c limits the maximum number of attributes in each group. h, h c, the variable S h denotes a set containing all candidate attribute groups in which the number of attributes is h. The set S contains all candidate attribute groups. First, we test every attribute in E, and insert all triples like (MAXF A, P D A, P T A) into S (at Lines 2-6). Subsequently, we generate the candidate set S h incrementally by invoking subroutine getcandidategroups. If no more candidate groups are generated, we have finished creating S (at Lines 7-4). Finally, this algorithm returns a group in S with maximum MAXFĜ value (at Line 5). Algorithm getcandidategroups (Algorithm 2) describes how to generate new candidate attribute groups when given three parameters, S, S and h. In general, S is a set containing all the single attributes, while S contains all candidate attribute groups which has h attributes. The goal is to generate a candidate set where the number of attributes in each group is h. As introduced above, we test every union of G G where G S and G S. The new group will be inserted into the candidate set only when its MAXF value is greater than that of G and G (at Lines 2-7). Algorithm getattributegroup(e, c) : S = ; 2: for each attribute A in E do 3: Compute MAXF A, P D A and P T A; 4: S S {(MAXF A, P D A, P T A)}; 5: S S ; 6: end for 7: for attribute group size h = 2 to c do 8: S h getcandidategroups(s, S h, h); 9: if S h = then 0: break; : else 2: S S S h ; 3: end if 4: end for 5: Let Ĝ be a group in S with maximum MAXF Ĝ value; 6: return Ĝ; 4. EXPERIMENTS Algorithm 2 getcandidategroups(s, S, h) : S h = ; 2: for each G S, G S do 3: G = G G 4: if (MAXF G > max(maxf G, MAXF G ) then 5: S h S h {(MAXF G, P D G, P T G)}; 6: end if 7: end for 8: return S h ; 4. Experiment Setup We wrote the codes in Java and conducted the experiments in a Windows system. The CPU is Intel Core 2.0GHz and the physical memory is 2GB. We use two real data sets: Cora data set and Restaurant data set 2. The Cora data set contains 876 citations of 9 literatures and more than ten attributes. We extract nine attributes: author, title, address, date, page, volume, publisher, editor and journal. The Restaurant data set contains 864 tuples which records the basic information of restaurants. There exist four attributes: name, address, city and type and all of them are used. We use 0-fold cross-validation to evaluate our method. 4.2 Single-Attribute Performance We select five distance functions in the following tests, including Cosine distance, Jaccard distance, Jaro distance, edit distance and q-gram distance. Figure and 2 show the maximum F-scores for all attributes in Cora and Restaurant data sets respectively. We can observe that each attribute has different capability for entity matching. Here, the title attribute by using edit distance is the best choice for Cora data set, and the name attribute by using q-gram distance is the best choice for Restaurant data set. Indeed, we can find that single attribute title can already achieve great F-score in Cora data set and distance functions show little difference in F-scores for short strings. Meanwhile, We can see that the q-gram distance behaves well in most of attributes. Maximal F-Score Cosine Jaccard Jaro Edit Distance q-gram Author Title Address Date Page VolumePublisher Editor Journal Figure : The MAXF for the Cora data set 4.3 Attribute Group Performance Now we want to evaluate the performance of attribute group. Table 5 and 6 illustrate the top-5 attribute groups we have obtained from training data for Cora and Restaurant data sets respectively. 2
5 Maximal F-Score Cosine Jaccard Jaro Edit Distance q-gram Precision Best Attribute Group of Cora DataSet Best Attribute Group of Restaurant DataSet 0. 0 Name Address City Type Recall Figure 2: The MAXF for the Restaurant data set In Cora data set, the best attribute group with weighting of each attribute which achieves maximum MAXF G in training data is the group of author, title, address and journal attributes. We can find the top-3 groups can achieve maximum F-score values around 0.94 in both training data and test data. The maximum F-score values of 4th and 5th groups in training data is relatively low. Consequently, they are also lower than those of the top-3 groups in test data, which demonstrates our method is stable. Meanwhile, we can observe that the top-5 attribute groups without weighting is not satisfactory. Because of ignoring the different capabilities among attributes, it will produce less proper attribute groups. In Restaurant data set, the best attribute group with weighting of each attribute is the group of name and address attributes. We can also find that the maximum F-score value of the 5th attribute group is lower than those of the top-4 attribute groups in both training data and test data. However, due to few matching record pairs in training data, there exists a deviation of maximum F-score values between training data and test data. Also we can observe that the top-5 attribute groups without weighting is not satisfactory. Figure 3 is the recall-precision curve under the best attribute group for Cora and Restaurant data set. If the distance threshold is set high, we get a big value in recall and a small value in precision. Contrarily, if the distance threshold value is set low, we get a big value in precision and a small value in recall. Both situations cannot meet the requirements in real applications. Fortunately, there still exist appropriate threshold values so that both values are great (at the top-right corner) by using our method. In fact, under such situations, the F-score value is approximately maximized. Moreover, the result for the Cora data set is more effective than that for the Restaurant data set since the former has a greater MAXF G value. 4.4 Efficiency of the Heuristic Algorithm Now we want to evaluate the performance of our heuristic algorithm introduced in Section 3. We don t limit the attribute group size and set the argument maximal number of attributes to the number of all attributes. We compared it with Naive algorithm which enumerates all the attribute groups and then select the proper attribute group which has maximum F-score. Figure 4 gives the runtime which is used to find a best attribute group in training data for Cora data set by using heuristic and naive approach. Notice that the vertical axis Figure 3: The Recall-Precision Curve under the best attribute group is the power of 0 and horizontal axis is the number of attributes which is used in current test. With the increasing number of attributes, we can find our heuristic algorithm outperforms the naive algorithm. This is because we have pruned so many groups which has the low probability to be the propel attribute group, avoiding to consider them further. Runtime(milliseconds) k=9 k=8 k=7 k=6 k=5 k=4 k=3 k=2 k= Attribute Number k Naive Algorithm Heuristic Algorithm Figure 4: Comparison of Runtime under different k 4.5 Comparison with Existing Techniques Finally, we compare our method with Op-Trees [2] and SiFi-Hill [7]. We split the Cora and Restaurant data set into 2 folds for cross-validation. Figure 5 shows the final results. We can observe that our method outperforms the Op-Trees and SiFi-Hill. We can achieve the highest F-score since we seek to achieve maximum F-score under the appropriate attribute group. However, SiFi-Hill is strong dependent on given rules. Once the rules is chosen improperly, this method can never achieve good results. Our method can automatically select the appropriate rules. Meanwhile, as proposed in [7], Op-Trees does not consider the redundancy among similarity functions and SVM consumes so much time for entity matching. 5. RELATED WORK As is mentioned in the introduction, there mainly exist two categories to solve the problem of entity matching. The first category is classification-based method. Bilenko et al. [] propose to use SVM classifier to solve it and achieve better results. In the attribute level, it considers two groups of
6 Table 5: Top-5 attribute groups for Cora data set Rank Attribute Group With Weighting Training MaxF Testing MaxF Author, Title, Address, Journal Title, Volume Title Author, Address, Page, Volume, Editor Author, Address, Date, Page, Publisher, Editor, Journal Rank Attribute Group Without Weighting Training MaxF Testing MaxF Title Author, Date, Page, Volume, Editor Author, Page, Volume, Publisher, Editor, Journal Author, Page, Volume, Editor Author, Page, Volume Table 6: Top-5 attribute groups for Restaurant data set Rank Attribute Group Training Testing Attribute Group Training Testing Rank With Weighting MAXF MaxF Without Weighting MAXF MaxF Name, Address Name, Address Name, Address, City Name Name, Address, City, Type Address Name, Address, Type City, Type Name, City, Type Type Maximal F-Score Cora Dataset Restaurant Dataset Op-Trees SiFi-Hill Our_Method SVM Figure 5: Comparison with existing techniques similarity functions: character-based and vector-based similarity functions. Character-based group mainly uses the edit distance while the vector-based group mainly uses Cosine distance. In the entity level, it generates m k features, where m is the number of attributes and k is number of similarity functions. By acquiring the feature vector, SVM is used to train the classifier. This approach has high accuracy. However, Wang et al. [7] show that this approach consumes much time in testing data and is difficult to explain and scale out. The rule-based method is explainable and scalable. Wang et al. [7] first propose the problem of how similar is similar. By given a labeled data set, they observe that different similarity functions and thresholds have redundancy. So they first prune redundant similarity functions and thresholds, and then devise three efficient algorithms to find matching entities. SiFi-Hill outperforms SiFi-Greedy and SiFi-Gradient as it considers the factor of attributes dependence. Relatively speaking, our approach can find a proper rule, namely proper attribute group, as well as proper threshold and proper distance function. Moreover, we can more quickly find the proper threshold and proper distance function since we needn t to iterate to find an optimal one. Besides, Chaudhuri et al. [2] propose another explainable technique for record matching. This method constructs operator tree to solve this problem. Guo et al. [7] novelly propose k-partite graph clustering to solve it. It believes each attribute value in the same entity should overlap. Hence, each node in k-partite graph represents a value of each attribute. There exists an edge between two nodes if the value of each attribute is stored in the same record. After clustering of k-partite, it can perform entity matching and data fusion simultaneously. In recent years, the research work also extends the entity matching issue to temporal data and transactional data. Li et al. [3] apply time decay to consider the effect on time for entity matching. The main idea is that the same entity will change attribute s value over a long time and different entities also share the same attribute s value in a period time. Therefore, when we consider whether the entity is matched or not, we should take the time decay as a weight to calculate the similarity of entity. Yakout et al. [8] present a new entity matching approach that uses entity behavior to merge the transactional data. The main idea is that if the entity behavior becomes more regular and stable when merge two transactional records, it is more likely that these two records refer to the same entity. This method also introduces a phase of candidate generation to reduce most non-matching pairs. 6. CONCLUSION In this paper, we present a new learning method for entity matching. At first, we define a new concept based on F-score to find proper threshold, proper distance for each attribute and proper group for the relation. Then we propose our approach to solve entity matching and devise a heuristic algorithm to find the appropriate attribute group. The experiments on two real data sets show our method is efficient, effective and stable. 7. ACKNOWLEDGEMENT The research of Cheqing Jin is supported by the National Basic Research Program of China (Grant No. 202CB36200),
7 the Key Program of National Natural Science Foundation of China (Grant No ), National Natural Science Foundation of China (Grant No ). The research of Aoying Zhou is supported by National Science Foundation for Distinguished Young Scholars (Grant No ), and Natural Science Foundation of China (No ). 8. REFERENCES [] M. Bilenko and R. J. Mooney. Adadptive duplicate detection using learnable string similarity measures. In ACM SIGKDD, pages 39 48, Washington, ACM. [2] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages , Vienna, ACM. [3] S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages , Vienna, ACM. [4] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 9(): 6, January [5] W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2():407 48, [6] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):83 20, 969. [7] S. Guo, X. Dong, D. Srivastava, and R. Zajac. Record linkage with uniqueness constraints and erroneous values. PVLDB, 3(0):47 428, 200. [8] M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. Sigmod Record, 24(2):27 38, 995. [9] D. O. Holmes and M. C. Mccabe. Improving precision and recall for soundex retrieval. In Proceedings of International Symposium on Information Technology, pages IEEE, 995. [0] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithems. In ACM SIGMOD, pages , New York, ACM. [] H. Lee, R. T. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In VLDB, pages ACM, [2] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages ACM, [3] P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(): , 20. [4] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. In ICDE, pages , Washington, 993. IEEE. [5] V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. Large-scale collective entity matching. PVLDB, 4(4):208 28, 20. [6] S. Sarawagi and A. Bhamidipaty. Iterative deduplication using active learning. In ACM SIGKDD, pages , New York, ACM. [7] J. Wang, G. Li, J. X. Yu, and J. Fen. Entity matching: How similar is similar. PVLDB, 0(20): , 4. [8] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi. Behavior based record linkage. PVLDB, 3(): , 200. [9] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5): , February 20. [20] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In ACM SIGMOD, pages ACM, [2] Y. C. Yuan. Multiple imputation for missing data: Concepts and new development. In In the 25th Annual SAS Users Group International Conference, 2002.
Entity Resolution with Heavy Indexing
Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu
More informationSimilarity Joins of Text with Incomplete Information Formats
Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.
More information(Big Data Integration) : :
(Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?
More informationActive Blocking Scheme Learning for Entity Resolution
Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking
More informationMetric and Identification of Spatial Objects Based on Data Fields
Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 25-27, 2008, pp. 368-375 Metric and Identification
More informationRule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)
American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized
More informationHOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery
HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery Ninh D. Pham, Quang Loc Le, Tran Khanh Dang Faculty of Computer Science and Engineering, HCM University of Technology,
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationA KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE
A KNOWLEDGE BASED ONLINE RECORD MATCHING OVER QUERY RESULTS FROM MULTIPLE WEB DATABASE M.Ann Michle, K.C. Abhilash Sam Paulstin Nanjil Catholic College of Arts and Science, Kaliyakkavilai. Objective-The
More informationDeduplication of Hospital Data using Genetic Programming
Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationRecord Linkage using Probabilistic Methods and Data Mining Techniques
Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University
More informationRLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.
German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center
More informationRanking Web Pages by Associating Keywords with Locations
Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn
More informationSorted Neighborhood Methods Felix Naumann
Sorted Neighborhood Methods 2.7.2013 Felix Naumann Duplicate Detection 2 Number of comparisons: All pairs 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 400 comparisons 12
More informationImproving effectiveness in Large Scale Data by concentrating Deduplication
Improving effectiveness in Large Scale Data by concentrating Deduplication 1 R. Umadevi, 2 K.Kokila, 1 Assistant Professor, Department of CS, Srimad Andavan Arts & Science College (Autonomous) Trichy-620005.
More informationSimilarity Joins in MapReduce
Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented
More informationRECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH
Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING
More informationSampling Selection Strategy for Large Scale Deduplication for Web Data Search
Sampling Selection Strategy for Large Scale Deduplication for Web Data Search R. Lavanya 1*, P. Saranya 2, D. Viji 3 1 Assistant Professor, Department of Computer Science Engineering, SRM University, Chennai,
More informationTowards a Domain Independent Platform for Data Cleaning
Towards a Domain Independent Platform for Data Cleaning Arvind Arasu Surajit Chaudhuri Zhimin Chen Kris Ganjam Raghav Kaushik Vivek Narasayya Microsoft Research {arvinda,surajitc,zmchen,krisgan,skaushi,viveknar}@microsoft.com
More information[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632
More informationOutline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration
Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach
More informationAn Efficient Clustering Method for k-anonymization
An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management
More informationTop-k Keyword Search Over Graphs Based On Backward Search
Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer
More informationParallel Similarity Join with Data Partitioning for Prefix Filtering
22 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.9, NO.1 May 2015 Parallel Similarity Join with Data Partitioning for Prefix Filtering Jaruloj Chongstitvatana 1 and Methus Bhirakit 2, Non-members
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationDatasets Size: Effect on Clustering Results
1 Datasets Size: Effect on Clustering Results Adeleke Ajiboye 1, Ruzaini Abdullah Arshah 2, Hongwu Qin 3 Faculty of Computer Systems and Software Engineering Universiti Malaysia Pahang 1 {ajibraheem@live.com}
More informationInformation Integration of Partially Labeled Data
Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de
More informationAn Iterative Approach to Record Deduplication
An Iterative Approach to Record Deduplication M. Roshini Karunya, S. Lalitha, B.Tech., M.E., II ME (CSE), Gnanamani College of Technology, A.K.Samuthiram, India 1 Assistant Professor, Gnanamani College
More informationA Survey on Removal of Duplicate Records in Database
Indian Journal of Science and Technology A Survey on Removal of Duplicate Records in Database M. Karthigha 1* and S. Krishna Anand 2 1 PG Student, School of Computing (CSE), SASTRA University, 613401,
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationCompression of the Stream Array Data Structure
Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In
More informationA semi-incremental recognition method for on-line handwritten Japanese text
2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationRiMOM Results for OAEI 2010
RiMOM Results for OAEI 2010 Zhichun Wang 1, Xiao Zhang 1, Lei Hou 1, Yue Zhao 2, Juanzi Li 1, Yu Qi 3, Jie Tang 1 1 Tsinghua University, Beijing, China {zcwang,zhangxiao,greener,ljz,tangjie}@keg.cs.tsinghua.edu.cn
More informationLearning Blocking Schemes for Record Linkage
Learning Blocking Schemes for Record Linkage Matthew Michelson and Craig A. Knoblock University of Southern California Information Sciences Institute, 4676 Admiralty Way Marina del Rey, CA 90292 USA {michelso,knoblock}@isi.edu
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationSurvey of String Similarity Join Algorithms on Large Scale Data
Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.
More informationRiMOM Results for OAEI 2009
RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn
More informationLearning Probabilistic Ontologies with Distributed Parameter Learning
Learning Probabilistic Ontologies with Distributed Parameter Learning Giuseppe Cota 1, Riccardo Zese 1, Elena Bellodi 1, Fabrizio Riguzzi 2, and Evelina Lamma 1 1 Dipartimento di Ingegneria University
More informationTexture Segmentation by Windowed Projection
Texture Segmentation by Windowed Projection 1, 2 Fan-Chen Tseng, 2 Ching-Chi Hsu, 2 Chiou-Shann Fuh 1 Department of Electronic Engineering National I-Lan Institute of Technology e-mail : fctseng@ccmail.ilantech.edu.tw
More informationLinkedMDB. The first linked data source dedicated to movies
Oktie Hassanzadeh Mariano Consens University of Toronto April 20th, 2009 Madrid, Spain Presentation at the Linked Data On the Web (LDOW) 2009 Workshop LinkedMDB 2 The first linked data source dedicated
More informationContour-Based Large Scale Image Retrieval
Contour-Based Large Scale Image Retrieval Rong Zhou, and Liqing Zhang MOE-Microsoft Key Laboratory for Intelligent Computing and Intelligent Systems, Department of Computer Science and Engineering, Shanghai
More informationRank Measures for Ordering
Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many
More informationA Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics
A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au
More informationYunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction
More informationResearch Article Apriori Association Rule Algorithms using VMware Environment
Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,
More informationPublished by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1
Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant
More informationA Survey on Postive and Unlabelled Learning
A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationDiscovering Advertisement Links by Using URL Text
017 3rd International Conference on Computational Systems and Communications (ICCSC 017) Discovering Advertisement Links by Using URL Text Jing-Shan Xu1, a, Peng Chang, b,* and Yong-Zheng Zhang, c 1 School
More informationImproving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique
Improving Privacy And Data Utility For High- Dimensional Data By Using Anonymization Technique P.Nithya 1, V.Karpagam 2 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College,
More informationQuery-Sensitive Similarity Measure for Content-Based Image Retrieval
Query-Sensitive Similarity Measure for Content-Based Image Retrieval Zhi-Hua Zhou Hong-Bin Dai National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China {zhouzh, daihb}@lamda.nju.edu.cn
More informationApproximate String Joins
Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different
More informationDistance-based Outlier Detection: Consolidation and Renewed Bearing
Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction
More informationA Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration
2010 Sixth International Conference on Semantics, Knowledge and Grids A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration Wei Liu 1,2, Xiaofeng Meng 3 1 Institute of Computer
More informationDynamic Clustering of Data with Modified K-Means Algorithm
2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq
More informationMining Frequent Itemsets for data streams over Weighted Sliding Windows
Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology
More informationFast Algorithms for Top-k Approximate String Matching
Fast Algorithms for Top-k Approximate String Matching Zhenglu Yang # Jianjun Yu Masaru Kitsuregawa # # Institute of Industrial Science, The University of Tokyo, Japan {yangzl, kitsure}@tkliisu-tokyoacjp
More informationFast or furious? - User analysis of SF Express Inc
CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood
More informationOutline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:
Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA
More informationAN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011
International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR
More informationPerformance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM
Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Lu Chen and Yuan Hang PERFORMANCE DEGRADATION ASSESSMENT AND FAULT DIAGNOSIS OF BEARING BASED ON EMD AND PCA-SOM.
More informationA fast approach for parallel deduplication on multicore processors
A fast approach for parallel deduplication on multicore processors Guilherme Dal Bianco Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre, RS, Brazil gbianco@inf.ufrgs.br
More informationUAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA
UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationTriple Indexing: An Efficient Technique for Fast Phrase Query Evaluation
Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation Shashank Gugnani BITS-Pilani, K.K. Birla Goa Campus Goa, India - 403726 Rajendra Kumar Roul BITS-Pilani, K.K. Birla Goa Campus Goa,
More informationEfficient Entity Matching over Multiple Data Sources with MapReduce
Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br
More informationAnalyzing Dshield Logs Using Fully Automatic Cross-Associations
Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationXML Data Stream Processing: Extensions to YFilter
XML Data Stream Processing: Extensions to YFilter Shaolei Feng and Giridhar Kumaran January 31, 2007 Abstract Running XPath queries on XML data steams is a challenge. Current approaches that store the
More informationMACHINE LEARNING BASED METHODOLOGY FOR TESTING OBJECT ORIENTED APPLICATIONS
MACHINE LEARNING BASED METHODOLOGY FOR TESTING OBJECT ORIENTED APPLICATIONS N. Kannadhasan and B. Uma Maheswari Department of Master of Computer Applications St. Joseph s College of Engineering, Chennai,
More informationA Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods
A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering
More informationMeasuring and Evaluating Dissimilarity in Data and Pattern Spaces
Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece
More informationEfficient Common Items Extraction from Multiple Sorted Lists
00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationA Conflict-Based Confidence Measure for Associative Classification
A Conflict-Based Confidence Measure for Associative Classification Peerapon Vateekul and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA
More informationA Generalized Method to Solve Text-Based CAPTCHAs
A Generalized Method to Solve Text-Based CAPTCHAs Jason Ma, Bilal Badaoui, Emile Chamoun December 11, 2009 1 Abstract We present work in progress on the automated solving of text-based CAPTCHAs. Our method
More informationParallelizing String Similarity Join Algorithms
Parallelizing String Similarity Join Algorithms Ling-Chih Yao and Lipyeow Lim University of Hawai i at Mānoa, Honolulu, HI 96822, USA {lingchih,lipyeow}@hawaii.edu Abstract. A key operation in data cleaning
More informationPerformance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances
Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Minzhong Liu, Xiufen Zou, Yu Chen, Zhijian Wu Abstract In this paper, the DMOEA-DD, which is an improvement of DMOEA[1,
More informationComprehensive and Progressive Duplicate Entities Detection
Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology
More informationA Fast Linkage Detection Scheme for Multi-Source Information Integration
A Fast Linkage Detection Scheme for Multi-Source Information Integration Akiko Aizawa National Intsitute of Informatics / The Graduate University for Advanced Studies 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
More informationRanking Clustered Data with Pairwise Comparisons
Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances
More informationEvaluation of Meta-Search Engine Merge Algorithms
2008 International Conference on Internet Computing in Science and Engineering Evaluation of Meta-Search Engine Merge Algorithms Chunshuang Liu, Zhiqiang Zhang,2, Xiaoqin Xie 2, TingTing Liang School of
More informationLog Linear Model for String Transformation Using Large Data Sets
Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,
More informationData Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group
Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies
More informationEntity Resolution, Clustering Author References
, Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering
More informationAn Ensemble Approach for Record Matching in Data Linkage
Digital Health Innovation for Consumers, Clinicians, Connectivity and Community A. Georgiou et al. (Eds.) 2016 The authors and IOS Press. This article is published online with Open Access by IOS Press
More informationRobust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology
Robust and Efficient Fuzzy Match for Online Data Cleaning S. Chaudhuri, K. Ganjan, V. Ganti, R. Motwani Presented by Aaditeshwar Seth 1 Motivation Data warehouse: Many input tuples Tuples can be erroneous
More informationEfficient Lists Intersection by CPU- GPU Cooperative Computing
Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative
More informationAssisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites *
Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Lijie Wang, Fei Liu, Ge Li **, Liang Gu, Liangjie Zhang, and Bing Xie Software Institute, School of Electronic Engineering
More informationA Hybrid Model Words-Driven Approach for Web Product Duplicate Detection
A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection Marnix de Bakker, Flavius Frasincar, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands
More informationAuthorship Disambiguation and Alias Resolution in Data
Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationCategorization of Sequential Data using Associative Classifiers
Categorization of Sequential Data using Associative Classifiers Mrs. R. Meenakshi, MCA., MPhil., Research Scholar, Mrs. J.S. Subhashini, MCA., M.Phil., Assistant Professor, Department of Computer Science,
More informationSTUDYING OF CLASSIFYING CHINESE SMS MESSAGES
STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2
More informationAN INTERACTIVE FORM APPROACH FOR DATABASE QUERIES THROUGH F-MEASURE
http:// AN INTERACTIVE FORM APPROACH FOR DATABASE QUERIES THROUGH F-MEASURE Parashurama M. 1, Doddegowda B.J 2 1 PG Scholar, 2 Associate Professor, CSE Department, AMC Engineering College, Karnataka, (India).
More informationCollective Entity Resolution in Relational Data
Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution
More information