Two-Phase Schema Matching in Real World Relational Databases

Size: px

Start display at page:

Download "Two-Phase Schema Matching in Real World Relational Databases"

Noel Chandler
5 years ago
Views:

1 Two-Phase Matching in Real World Relational Databases Nikolaos Bozovic #1, Vasilis Vassalos #2, # Department of informatics, Athens University of Economics and Business, Greece 1 nbosovic@gmail.com 2 vassalos@aueb.gr Abstract We propose a new approach to the problem of schema matching in relational databases that merges the hybrid and composite approach of combining multiple individual matching techniques. In particular, we propose assigning individual matchers to two categories, strong matchers that provide apriori higher quality matches, and weak matchers that may be more sensitive to the inputs and are less reliable but can still help generate some matches. Matching is correspondingly done in two phases, with strong matches being produced by strong matchers being combined using a simple voting combiner, and weak matchers providing additional evidence for attributes left unmatched (again using a voting combiner). We observe that, while many recent advances in schema matching [2] [5] [7] [11] use composite schema matching and rely on the existence of training schemas to train combiners, in many real-world situations it is not feasible to employ learning techniques because of the unavailability of training data (i.e., schemas or instance data.) We hypothesize that weak matchers can often hurt overall accuracy if used in a single-phase composite matcher that does not employ learning techniques. We implement our two-stage approach in the ASID system and evaluate it using real life schemas. The experiments validate our hypothesis regarding the negative effect of weak matchers and also show ASID performs comparably to state of the art systems while requiring no training schemas. We also demonstrate the benefits of a simple documentation-based matcher. Our experimental data included schemas ranging from 20 to 120 attributes. Note that schemas with 120 attributes are as large or larger than other published evaluations of relational schema matching. I. INTRODUCTION Continued high interest in information integration has led to a large body of research work on schema matching (e.g., [1] [2] [3]). matching is a process that takes two (or more) schemas as input and produces a mapping between pairs of schema elements that correspond semantically to each other [1]. The existence of these numerous research efforts on schema matching lead to research works devoted to the evaluation and comparison of proposed solutions [1] [2], as well as evaluations of the evaluation techniques [4]. At the same time, the research and method evaluation focus most recently has been on internet and semistructured data and ontology matching, even as the challenges of relational schema matching have not been fully addressed. Recently, more and more approaches [2] [5] [16] [15] use machine learning in order to help produce matches. Learning techniques ranging from simple Naïve Bayes [21] classifiers to custom classifiers to neural networks have been used for matching on both the attribute and instance levels. While learning is shown to provide significant advantages, there are many cases where this approach is not feasible. In particular, neither large quantities of data nor many training schemas may be available. While it is true that large database systems contain more than enough instances to train instance-based schema learners, these data may not be available. If, as is often the case, the owners of the databases that need schema matching are governmental agencies storing sensitive population data, data instances may not be available for schema matching due to regulatory restrictions. Moreover, schema matching is often performed on systems not yet operational, where no data exists yet [4]. The availability of data is only one problem. Quality is the other. As every database administrator knows, all database systems contain inconsistent data even if every precaution is made to avoid them. A question that remains unanswered is the impact that this has in the learning and/or matching phase, as most existing work dismisses the potential impact of dirty data. Techniques that require training on schemas are even less suitable for common enterprise schema matching tasks: more often than not the matching of relational schemas in an enterprise is done between 2 or 3 schemas, for example during mergers, or when upgrading or replacing an existing system. In these common cases, there are no schemas to use for training. s may be available for training when integrating large numbers of databases, e.g. on the Web, but even then, the work required to match every attribute of enough schemas to train our algorithms may be disproportionate to the usefulness of the final result. GTE Telephone Operations, based on work done for 40 applications, estimates an average of four hours per data element to extract and document data semantics when the task must be performed by someone other than the data owner [1]. This shows that approaches needed training schemas may be good for hundreds of small schemas such as those we find in internet applications, but is not good for the large schemas that are in use in today s large organisations. In environments where training combiners and using other learning techniques is not feasible, systems such as COMA++ [17] and Cupid [7] can be used, but a number of challenges remain. We try to answer the following two /08/$ IEEE 290 ICDE Workshop 2008

2 questions in this paper. First, is it always better to use more individual matchers in a single-phase composite system when the match combiner is not a learning-based one? Or is it better to prioritize them appropriately (as we would in a hybrid system), while still using a composite, vote-based combiner in each phase? The two phase process performs the matching using a set of strong matchers, i.e., matchers that are known (e.g., from existing literature) to provide more reliable matches with less variance, to determine the easy matches that are directly output. It then employs the help of weaker matchers, e.g., instance-level matchers dependent on the amount of available data, to discover more matches or provide evidence for low confidence matchers discovered in the first phase. The hypothesis is that, while every matcher (and all of them together) can potentially help discover a match, using all matchers in a single-phase matching process with a nonlearning-based combiner could seriously degrade the quality of the matches due to the sometimes erratic behaviour of the weak matchers. The second question we try to address concerns the efficacy of a documentation-based matcher. Many recent research works downplay the usefulness of documentation a representative remark being Extracting semantics information from data creators and documentation is often extremely cumbersome. [...] Documentation tends to be sketchy, incorrect, and outdated. [18] While that is correct for web forms or XML data, it is an overstatement when referring to relational databases. Large systems usually are documented and on most cases this documentation is up to date and correct, as large organizations often take seriously the correctness and completeness of the documentation they receive with the systems they buy or develop. Apart from personal experience, this is corroborated from other sources: Out of 265 conceptual (ER) models from the U.S. Department of Defence metadata registry that contain 13,049 elements (entities or relationships) and 163,736 attributes, the vast majority are documented, albeit with a single sentence [19]. This information is a valuable tool for schema matching as we will demonstrate. As with most sources of information in schema matching, there are more than one way to exploit documentation. We use a simple technique, similar to one used in [2], to match textual descriptions of attributes, and experimentally show the robustness of this technique. To explore the benefits of two-phase matching and to implement our documentation-based attribute matcher, we built the ASID schema-matching system. ASID (Another Integration Daemon) uses variations of existing attribute matching techniques and methods (no structural matching), to make it simpler to isolate the impact of twophase matching and to enable more or less direct comparison of its performance to existing schema matching systems. In the following sections, we first provide an architectural overview of ASID, then in Section III we describe in more detail the two phases of matching, including the matchers used in each phase, and in Section IV we present an interesting experimental study. Related work is presented in Section V. II. THE ASID SYSTEM A. A System that ensures matcher robustness Most contemporary systems are composite systems; that is, more than one method is used to produce schema matches, and their results are combined to produce the final matches. While composite systems have proven to provide increased matching accuracy, but an important question remains: As some matching techniques are less reliable than others, can they unbalance a composite matching system? In ASID, we propose separating the available matching techniques into strong and weak ones. This distinction can be based on existing research results and trial-and-error experimentation. Strong techniques are those that have consistently shown good results in schema matching and are less likely to be affected by differences in domains, implementation details and/or the size and quality of instance data. Other techniques not exhibiting such desired behaviour are classified as weak : they can be used and be helpful, but their credibility is less certain. In ASID, matching is performed in two phases: Strong techniques take precedence. The first-phase combiner uses their results to produce plausible matches that are immediately output to the user. If there are unmatched attributes or the score is too low for some match, these attributes are passed to the second-phase matching, where the weak techniques are also used. In the second phase, the results from all techniques are combined. This arrangement ensures that matches are not unbalanced by weak matching techniques, but that the system can benefit from the extra matching ability they provide. FIGURE 1. THE ASID SYSTEM 291

3 For the prototype implementation we created four matchers, two strong ones and two weak ones. In each phase, the combiner is simply a voting one: a match candidate is considered a match when its combined score is above a certain threshold. In the second phase, the weak methods help produce more high quality matches by taking the unmatched results of the strong phase and possibly raising their scores above the threshold level. B. A word on data cleaning As any database administrator knows, dirty data are always present in every database containing a non trivial amount of data. Dirty data are more common than most of the schema matching research acknowledges, and their impact on instance-level schema matching performance is unknown. ASID incorporates a simple data cleaning step before loading instance-level data. In the current implementation, the cleaning process is manual. The impact of data cleaning is only briefly addressed in this paper (see Section IV. C), for more information please see [20]. Loader Internal Representation III. ASID SYSTEM MODULES Information Strong Matching Internal Representation with Results Primary Evaluator List Of good matches Strong matching results Unmatched attributes Prediction Combiner Confident Matches Instance Data Data Cleaner Cleaned Data Clean Data Cleaned Data Relevant instance selector Representation + instances Weak Matching Weak Matching results The current ASID implementation includes two matching techniques; one based on string similarity between the names of the attributes and one based on TF/IDF weighted cosine similarity on the description of the attributes. The system may of course use additional techniques, or replace the existing ones. In principle, schema-based, rather than instance-based, techniques seem to be more appropriate for the strong matching module. More details on the techniques used are available in the following paragraphs. 1) Name Matching Name matching is a string matching technique between the names of the attributes, used extensively in schema matching. We use the Jaro metric for this task as recent research [14] has shown that it is one of the best algorithms for name matching. The Jaro metric is based on the number and order of the common characters between two strings. Given strings s = a1... ak and t = b1... bl, define a character a i in s to be common with t there is a b j = a i in t min( s t ) such that i H j i + H where H = 2 Let s = a... 1 a k ' be the characters in s which are common with t (in the same order they appear in s) and t = b... 1 b L' be defined the same way for t, Now define a transposition for s, t to be a position i such that a i b i. Let Ts', t' be half the number of transpositions for s and t. The Jaro similarity 1 s' t' s' T s ', t' metric for s and t is Jaro(s, t) = s t s It is out of the scope of this text to further analyse the algorithm used. For more information, the reader could consult [14]. In our implementation, scores range in [0, 1] with 1 being a total match. 2) Attribute Description Matching This method tries to exploit simple forms of documentation that often exists in relational systems used in organizations. In most RDBMS deployments, each attribute has a description written in natural language that describes its contents, usually in a sentence or two. Final Matches FIGURE 2. ASID 2-PHASE MATCHING A. Strong Matching Module The internal representation of two schemata (named source and target, according to their role) is fed into the system, and all available techniques are used independently to match each of the matched schema attributes to the corresponding ones of the target one. FIGURE 3. CREATING THE CORPUS Our matching method is a simple one: for each attribute of the source schema, we create a text corpus made of its description string and the description strings of all the 292

4 attributes of the target schema that are possible matches. Then we use TF/IDF weighting to compute vector similarity between the first member of the corpus and the remaining ones: the more similar the description, the higher the score. This method ensures that words appearing many times in the corpus, like table, are not overly important to the computation of similarity, ensuring that the truly important words dominate the result. Preliminary results of this method show very good results -- more than 70% true matches when all attributes have descriptions. Further improvement can be obtained by using dictionaries and synonym tables to aid the identification of descriptions with different wordings. B. Primary Evaluator This is a simple match combiner that uses voting to compute an overall score for each possible match. All scores available from the strong matching technique are added and the sum is divided by their number (currently 2). If after the total evaluation a match is deemed highly probable (with a combined score greater than 0.5), then the evaluator produces a good match, presenting it to the user and removing the matched couple from the source and target schemata. The difference to existing composite systems is that those matches failing to meet the threshold criterion are redirected to the weak matching phase in the hope of finding a better match with its help. It is important to note that all possible (but not actual) matches proceed to be re-evaluated. We also tried to promote only the attribute pairs matches with the highest scores to the second phase, but experiments showed a comparatively larger number of false positives. Hence all nonmatched attribute pairs proceed to the next phase. C. Weak Matching Module The weak matching module is almost identical in structure to the strong one, with one obvious difference: It uses different matchers. Given that in our implementation the weak matchers are data instance-based, the module needs to access and use data instances. The two methods used in the current ASID implementation are described in more detail in the following paragraphs. 1) Naïve Bayes classifier All available data from one schema are fed into a simple naïve bayes classifier; after the learning phase is complete, the sample data of the target attribute are classified. The total score for all attributes is normalized in [0,1] to be compliant in range to the previously computed score. The formula by which the normalization is achieved is a compromise between robustness and ease of implementation. The transformation formula used is si smin si ' = where si is the individual score of an smax smin attribute match and s min and smax respectively are the global minimum and maximum scores of said attribute. The intuition behind this is that the best score for an attribute will be transformed to 1 while the lowest will be 0. It is important to note that this procedure gives a boost to the best score that may lead to the production of false positives. When faced with the dilemma of producing some extra false positives or risk losing some good matches, we opted for the first choice, the rationale being that, in large-scale schema matching, a false positive is easier for a human to dismiss than it is for her to discover a missed true match. The same formula is used for the normalization of scores of the second weak matcher, described below. 2) TF-IDF matching This matcher is inspired by the WHIRL system, which extends relational databases to reason about the similarity of text-valued fields using information-retrieval technology [6]. We use only the similarity function used in WHIRL to create a matcher for ASID. Specifically, data is inserted in the matcher and a collection of documents is created where each document consists of all the instances of any available target attribute. A source document is created from the data available for the source attribute. We use TF/IDF weighting to compute vector similarity scores between the source document in the collection and the target ones. The more similar the source document is to the corpus of the data, the higher the score. The last step is to normalize the results using the formula described above. D. Prediction Combiner This is again a simple voting prediction combiner. All available scores (normalized in [0, 1]) are added and the sum is divided by their number. In the current implementation, the maximum number of scores is four, assuming the availability of data and descriptions. If a score is above 0.5, then we have a possible match. The matching couple is extracted from the pool of available attributes. IV. EXPERIMENTS AND EVALUATION OF THE ASID SYSTEM A. Overview Two distinct and very different schema collections were used in the experiments. The aim was to see the system perform matches for both small (~30 attributes) and medium (~120 attributes) schemas. For the small schema test we used part of the data used in [2] to evaluate the system. The only transformation to the available data was in the way data was made available to the system. For the medium schema experiments, we used real schemas from the Greek cadastre. The Cadastre is a large project many years in development and it involves a large and changing number of subcontractors, and has already had more than one restart. The result is that there is a plethora of different schemas, developed independently but designed to address the same needs. The quality of the available data for each schema is varied as well. These schemas have up to 120 attributes, making them among the largest schemas used in the literature for the evaluation of relational schema matching techniques. 293

5 Experiments were performed on a Pentium IV 3.0GHz running Windows XP. B. Small schema tests The small schema dataset was kindly provided by AnHai Doan and is the relational dataset used in [2]. We first tested the performance of the ASID system, with an eye towards our two research questions. The input schemas ranged from 20 to 28 attributes. The correspondences between the attributes were sometimes complex (rather than one-on-one) although that was not the case often. The following table contains information on the four schemas used and the experiments each one participated in, as numbered in Figure 4 and Figure 5. Because the schemas were used as provided with no further processing, there was no significant overlap among data instances except where such an overlap may be attributed to chance or natural language (i.e., common words in titles of books or fake numbers like ) so instance-level techniques did not help much. Total Data Part of experiment number Atributes instances ,45 KB 1,8,11 as source 4,5,6 as target ,03 KB 2,5,12 as source 7,8,9 as target ,48 KB 3,6,9 as source 10,11,12 as target ,02 KB 4,7,10 as source 1,2,3 as target TABLE 1. SMALL SCHEMA INFORMATION The system, using only 4 simple matchers in the two-phase configuration produced more than 80% of true matches. The following figure shows the performance of the system in each of the 12 experiments conducted. The experiments validate the hypothesis that using a one-phase system would significantly degrade performance. They also show a (small) improvement due to the addition of the weak matchers. In the following figure and in FIGURE 6, Common refers to the number of matching attributes in the two schemas as determined by human experts. Moreover, ASID performance in symmetric test cases (for example 1 and 4, or 8 and 5) was pretty similar, showing the system s robustness to the choice of source and target schemas Common One Phase Matching Only Strong Matchers ASID The accuracy achieved by ASID in the same experiments is summarized in the following figure. Overall , , , , , , ,5 85, , , , , FIGURE 5. SMALL SCHEMAS ASID ACCURACY (%) C. Medium schema tests The experiments were conducted using schemas created during the initial phase of the implementation of the Greek cadastre. The schemas were designed to be used during the data collection processes. The data fed into the system were the distinct values of more than tuples in each case. All schemas and data are real. To our knowledge, this is the first time results are published of a system evaluation against such a real schema set. Our experiments measured first the accuracy of ASID schema matching. Given the existence of a large corpus of instance level data of varying quality, we also tested the impact of dirty data on the quality of the matches. The following table describes the schemas used in the experiments. number Total Attributes Data instances Part of experiment ,39 MB 1,2 as source 3,5 as target ,41 MB 3,4 as source 1,6 as target ,29 MB 5,6 as source 2,4 as target TABLE 2. MEDIUM SCHEMA INFORMATION The results of the experiment show that, despite the increased complexity of real world schemas, accuracy remains high at ~78% The small reduction compared to the small schema experiments shows that ASID is robust with regard to increased schema size. Also notable is that the system remains robust to the choice of source and target schemas Common ASID FIGURE 4. NUMBER OF TRUE POSITIVES ON SMALL SCHEMAS FIGURE 6. NUMBER OF TRUE POSITIVES ON MEDIUM SCHEMAS 294

6 Total , , , , , , , FIGURE 7. MEDIUM SCHEMAS ASID ACCURACY (%) To evaluate the benefit of two-phase matching, and the negative impact of adding the weak matchers in a single-phase composite matching system, we report the results of the experiments from FIGURE 6 when performed with a) a singlephase combination of all matchers and b) using only strong matchers. Results show that ASID outperforms the alternative strategies as is illustrated in FIGURE 8. Although the weak matchers contribute when used in then ASID system, especially when there are enough data available to train them, they often degrade the quality of the produced matches when a one stage matching is performed. Notice that this is despite the availability of large data sets for the instance-based matchers Common Attributes One Phase Matching ASID Strong Only Matchers FIGURE 8. ASID S PERFORMANCE COMPARED TO ALTERNATIVE STRATEGIES To evaluate the impact of dirty data, we identified two experiments, namely experiments 1 and 3, where clean data with considerable overlap were available. In these cases ASID achieves over 90% accuracy. Excluding these two experiments, the average accuracy is 73%. While this experiment is not the last word on the subject, it does provide evidence for the effect of dirty data on the quality of schema matches and the benefits from data cleaning. D. Time and memory performance An additional goal of our work is to demonstrate efficient and effective schema matching performance on run-of-themill hardware. In all small-schema experiments the matching was complete in under a minute, and the memory footprint of the system remained under 22MB. The most important factor for performance was the size of the data instances used in the second phase of the matching. The number of attributes is not as important the most resource-intensive part of the system is the training of the instance-based matchers. Maximum memory usage in the medium schema experiments was between 58 and 136 Mb and maximum CPU time was between 810 and 2012 seconds as reported by Windows. The large data instances are the main reason for the increased resource consumption and time. These results confirms the feasibility of matching relatively large schemas (up to 120 attributes) using few computational resources in reasonable time. V. RELATED WORK In this section, we briefly examine five important schema matching systems that influence the design of ASID: imap [11], LSD [5], COMA++ [9] [17], Similarity Flooding [12], and MKB (Corpus based) [2]. There are a number of other schema matching systems and techniques e.g., [7] [8] [19] designed to address different aspects of the schema matching problem. The above five are the most relevant to ASID as they use a wide variety of relevant approaches. A. imap The main goal of this system is to discover complex matches. A typical example, may be that of an address = concat(city,state). The hardest problem to solve in complex matching is the unlimited number of possible matches one has to examine. To address the above problem the imap system perceives the generation of complex matches as a search in the space of possible matches. The system achieves 43-92% success in matching attributes (1 to 1 or complex). This range shows that the system works but is not as stable as one would want for industrial schema matching. B. Similarity Flooding This is a system based on a graph matching algorithm also called Similarity Flooding (SF). The aim of the algorithm/system is to allow quick development of matchers for a broad spectrum of different scenarios. The system may not outperform custom matchers that are highly tuned for a specific domain. The system is based on the conversion of the matched schemas to directed labelled graphs and the application of fixpoint computation to extract matches. The strength of the algorithm (and also its weakness) is the lack of knowledge it has for the attributes being matched. This had a side effect of producing out of context matches that need to be filtered out. C. COMA++ The basic idea behind COMA++ is to create a true composite system. This is a generic system, designed to be adaptable to many matching problems, achieving notable results not only in data integration but in ontology matching as well [10]. D. LSD The LSD system was proposed in 2003 as a vehicle to demonstrate the idea of a multi-strategy learning approach, in schema matching. While results show that LSD achieves

7 92% accuracy across the four domains mentioned in the papers, the LSD approach mandates the existence of a metalearner and demands training on target schemata, something that often is not possible. E. MKB The MKB approach is based on the observation that schema matching tasks are often repetitive. While this is not true in most cases of enterprise integration, it is often the case when integrating web-based sources. The main component of the system is a training module that trains classifiers for each attribute classified by the system in the past. When a new element arrives, those trained classifiers are applied to it and the degree of similarity of the new element with the old ones is calculated. The system shows high accuracy, either used autonomously or an additional source of information for schema matching when the existence of past mappings is available. As mentioned, it requires a large amount of matched schemata, something not likely to be available in industrial RDBMS deployments. VI. CONCLUSIONS Even though there is a significant body of work on schema matching, some of the more recent and most successful approaches are more geared towards XML and web data integration than enterprise schema matching; they make unrealistic assumptions regarding the availability of training schemas and data instances, and disregard the existence of documentation. We describe ASID, a novel schema-matching system that displays robust accuracy and performance in reallife relational schema matching tasks without requiring matched schemas for training. ASID helps us evaluate the hypothesis that, in the absence of learning prediction combiners, single-phase composite schema matchers can degrade by the inclusion of unreliable, weak schema matchers. The ASID solution is to prioritize known reliable matchers in a first phase of matching, and only use the less reliable or more instance-based matchers in a second matching phase. It also employs a simple matcher that exploits existing schema documentation for matching. Our experiments on both small and relatively large (120 attributes) real-life schemas validate our hypothesis and show the benefits of our solution, including the documentation-based matcher. In particular, they show high accuracy of attribute matches for ASID, and significantly lower accuracy for a single-phase schema matcher that combines all of ASID s individual attribute matchers. The additional contribution of ASID is a data cleaning step to enhance the performance of instance-based matchers; preliminary results confirm the sensitivity of instance-based matchers to dirty data and the benefits of feeding them with clean data. ACKNOWLEDGMENT We would like to thank the authors of [2] for providing us with the relational schema corpus used in their experiments. The second author is supported by a Marie Curie Outgoing International Fellowship. Research support by the PYTHAGORAS EPEAEK programme funded by the EU and the Greek Ministry of Education is gratefully acknowledged. REFERENCES [1] E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4): , 2001 [2] J. Madhavan, P. A. Bernstein, A. Doan, A. Y. Halevy: Corpus-based Matching. Proceedings of ICDE Conf., 2005: [3] P. Shvaiko and J. Euzenat. A Survey of -Based Matching Approaches. In Journal on Data Semantics IV, LNCS 3730, pp , [4] H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In Proceedings of the 2nd Int. Workshop on Web Databases (German Informatics Society), [5] A. Doan, P. Domingos, and A. Halevy. Learning to match the schemas of data sources: A multistrategy approach. Machine Learning, 50: , [6] W. Cohen and H. Hirsh. Joints that generalize: Text classification using WHIRL. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD), [7] J. Madhavan, P.A. Bernstein, E. Rahm: Generic Matching using Cupid. VLDB [8] H.H. Do and E. Rahm: COMA - A System for Flexible Combination of Matching Approaches. Proc. Intl. Conf. Very Large Databases (VLDB), 2002 [9] D. Aumüller, H.H. Do, S. Massmann and E. Rahm: and Ontology Matching with COMA++ (Software Demonstration). Proc. 24. ACM SIGMOD Intl. Conf. Management of Data, 2005 [10] Massmann, S.; Engmann, D.; Rahm, E.:COMA++: Results for the Ontology Alignment Contest OAEI 2006, International Workshop on Ontology Matching, collocated with the 5th ISWC-2006; Athens, Georgia, USA [11] Dhamankar, R.; Lee, Y.; Doan, A.; Halevy, A.; and Domingos, P. imap: Discovering complex matches between database schemas. In Proc. of the ACM SIGMOD Conf [12] Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding - a versatile graph matching algorithm. In: Proc 18th Int Conf Data Eng. [13] Li W, Clifton C. SemInt: a tool for identifying attribute correspondences in heterogeneous databases using neural network. Data Knowl Eng 33(1):49 84, 2000 [14] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, Adaptive Name Matching in Information Integration, IEEE Intelligent Systems, vol. 18. no. 5 (2003) [15] Li W, Clifton C. SemInt: a tool for identifying attribute correspondences in heterogeneous databases using neural network. Data Knowl Eng 33(1):49 84, 2000 [16] V. Ventrone, S. Heiler, Some advice for dealing with semantic heterogeneity in federated database systems, in: Proceedings of the Database Colloquium, San Diego, August 1994, Armed Forces Communications and Electronics Assc. (AFCEA). [17] H. H. Do, E. Rahm: Matching large schemas: Approaches and evaluation. Information Systems 32(6): (2007) [18] A. Doan and A. Halevy. Semantic-integration research in the database community: A Brief Survey. AI Magazine, 26-1: , 2005 [19] P. Mork, A. Rosenthal, L. Seligman, J. Korb, and K. Samuel. Integration Workbench: Integrating Integration Tools, The MITRE Corporation, Case # , May 2006 [20] N. Bosovic, The ASID schema matching system, MSc Thesis, Department of Informatics, AUEB, 2007 [21] Jacob Berlin, Amihai Motro: Database Matching Using Machine Learning with Feature Selection. A. Banks Pidduck et al. (Eds.): CAISE 2002, LNCS 2348, pp ,

Matching and Alignment: What is the Cost of User Post-match Effort?

Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,