Two-Phase Schema Matching in Real World Relational Databases

Size: px
Start display at page:

Download "Two-Phase Schema Matching in Real World Relational Databases"

Transcription

1 Two-Phase Matching in Real World Relational Databases Nikolaos Bozovic #1, Vasilis Vassalos #2, # Department of informatics, Athens University of Economics and Business, Greece 1 nbosovic@gmail.com 2 vassalos@aueb.gr Abstract We propose a new approach to the problem of schema matching in relational databases that merges the hybrid and composite approach of combining multiple individual matching techniques. In particular, we propose assigning individual matchers to two categories, strong matchers that provide apriori higher quality matches, and weak matchers that may be more sensitive to the inputs and are less reliable but can still help generate some matches. Matching is correspondingly done in two phases, with strong matches being produced by strong matchers being combined using a simple voting combiner, and weak matchers providing additional evidence for attributes left unmatched (again using a voting combiner). We observe that, while many recent advances in schema matching [2] [5] [7] [11] use composite schema matching and rely on the existence of training schemas to train combiners, in many real-world situations it is not feasible to employ learning techniques because of the unavailability of training data (i.e., schemas or instance data.) We hypothesize that weak matchers can often hurt overall accuracy if used in a single-phase composite matcher that does not employ learning techniques. We implement our two-stage approach in the ASID system and evaluate it using real life schemas. The experiments validate our hypothesis regarding the negative effect of weak matchers and also show ASID performs comparably to state of the art systems while requiring no training schemas. We also demonstrate the benefits of a simple documentation-based matcher. Our experimental data included schemas ranging from 20 to 120 attributes. Note that schemas with 120 attributes are as large or larger than other published evaluations of relational schema matching. I. INTRODUCTION Continued high interest in information integration has led to a large body of research work on schema matching (e.g., [1] [2] [3]). matching is a process that takes two (or more) schemas as input and produces a mapping between pairs of schema elements that correspond semantically to each other [1]. The existence of these numerous research efforts on schema matching lead to research works devoted to the evaluation and comparison of proposed solutions [1] [2], as well as evaluations of the evaluation techniques [4]. At the same time, the research and method evaluation focus most recently has been on internet and semistructured data and ontology matching, even as the challenges of relational schema matching have not been fully addressed. Recently, more and more approaches [2] [5] [16] [15] use machine learning in order to help produce matches. Learning techniques ranging from simple Naïve Bayes [21] classifiers to custom classifiers to neural networks have been used for matching on both the attribute and instance levels. While learning is shown to provide significant advantages, there are many cases where this approach is not feasible. In particular, neither large quantities of data nor many training schemas may be available. While it is true that large database systems contain more than enough instances to train instance-based schema learners, these data may not be available. If, as is often the case, the owners of the databases that need schema matching are governmental agencies storing sensitive population data, data instances may not be available for schema matching due to regulatory restrictions. Moreover, schema matching is often performed on systems not yet operational, where no data exists yet [4]. The availability of data is only one problem. Quality is the other. As every database administrator knows, all database systems contain inconsistent data even if every precaution is made to avoid them. A question that remains unanswered is the impact that this has in the learning and/or matching phase, as most existing work dismisses the potential impact of dirty data. Techniques that require training on schemas are even less suitable for common enterprise schema matching tasks: more often than not the matching of relational schemas in an enterprise is done between 2 or 3 schemas, for example during mergers, or when upgrading or replacing an existing system. In these common cases, there are no schemas to use for training. s may be available for training when integrating large numbers of databases, e.g. on the Web, but even then, the work required to match every attribute of enough schemas to train our algorithms may be disproportionate to the usefulness of the final result. GTE Telephone Operations, based on work done for 40 applications, estimates an average of four hours per data element to extract and document data semantics when the task must be performed by someone other than the data owner [1]. This shows that approaches needed training schemas may be good for hundreds of small schemas such as those we find in internet applications, but is not good for the large schemas that are in use in today s large organisations. In environments where training combiners and using other learning techniques is not feasible, systems such as COMA++ [17] and Cupid [7] can be used, but a number of challenges remain. We try to answer the following two /08/$ IEEE 290 ICDE Workshop 2008

2 questions in this paper. First, is it always better to use more individual matchers in a single-phase composite system when the match combiner is not a learning-based one? Or is it better to prioritize them appropriately (as we would in a hybrid system), while still using a composite, vote-based combiner in each phase? The two phase process performs the matching using a set of strong matchers, i.e., matchers that are known (e.g., from existing literature) to provide more reliable matches with less variance, to determine the easy matches that are directly output. It then employs the help of weaker matchers, e.g., instance-level matchers dependent on the amount of available data, to discover more matches or provide evidence for low confidence matchers discovered in the first phase. The hypothesis is that, while every matcher (and all of them together) can potentially help discover a match, using all matchers in a single-phase matching process with a nonlearning-based combiner could seriously degrade the quality of the matches due to the sometimes erratic behaviour of the weak matchers. The second question we try to address concerns the efficacy of a documentation-based matcher. Many recent research works downplay the usefulness of documentation a representative remark being Extracting semantics information from data creators and documentation is often extremely cumbersome. [...] Documentation tends to be sketchy, incorrect, and outdated. [18] While that is correct for web forms or XML data, it is an overstatement when referring to relational databases. Large systems usually are documented and on most cases this documentation is up to date and correct, as large organizations often take seriously the correctness and completeness of the documentation they receive with the systems they buy or develop. Apart from personal experience, this is corroborated from other sources: Out of 265 conceptual (ER) models from the U.S. Department of Defence metadata registry that contain 13,049 elements (entities or relationships) and 163,736 attributes, the vast majority are documented, albeit with a single sentence [19]. This information is a valuable tool for schema matching as we will demonstrate. As with most sources of information in schema matching, there are more than one way to exploit documentation. We use a simple technique, similar to one used in [2], to match textual descriptions of attributes, and experimentally show the robustness of this technique. To explore the benefits of two-phase matching and to implement our documentation-based attribute matcher, we built the ASID schema-matching system. ASID (Another Integration Daemon) uses variations of existing attribute matching techniques and methods (no structural matching), to make it simpler to isolate the impact of twophase matching and to enable more or less direct comparison of its performance to existing schema matching systems. In the following sections, we first provide an architectural overview of ASID, then in Section III we describe in more detail the two phases of matching, including the matchers used in each phase, and in Section IV we present an interesting experimental study. Related work is presented in Section V. II. THE ASID SYSTEM A. A System that ensures matcher robustness Most contemporary systems are composite systems; that is, more than one method is used to produce schema matches, and their results are combined to produce the final matches. While composite systems have proven to provide increased matching accuracy, but an important question remains: As some matching techniques are less reliable than others, can they unbalance a composite matching system? In ASID, we propose separating the available matching techniques into strong and weak ones. This distinction can be based on existing research results and trial-and-error experimentation. Strong techniques are those that have consistently shown good results in schema matching and are less likely to be affected by differences in domains, implementation details and/or the size and quality of instance data. Other techniques not exhibiting such desired behaviour are classified as weak : they can be used and be helpful, but their credibility is less certain. In ASID, matching is performed in two phases: Strong techniques take precedence. The first-phase combiner uses their results to produce plausible matches that are immediately output to the user. If there are unmatched attributes or the score is too low for some match, these attributes are passed to the second-phase matching, where the weak techniques are also used. In the second phase, the results from all techniques are combined. This arrangement ensures that matches are not unbalanced by weak matching techniques, but that the system can benefit from the extra matching ability they provide. FIGURE 1. THE ASID SYSTEM 291

3 For the prototype implementation we created four matchers, two strong ones and two weak ones. In each phase, the combiner is simply a voting one: a match candidate is considered a match when its combined score is above a certain threshold. In the second phase, the weak methods help produce more high quality matches by taking the unmatched results of the strong phase and possibly raising their scores above the threshold level. B. A word on data cleaning As any database administrator knows, dirty data are always present in every database containing a non trivial amount of data. Dirty data are more common than most of the schema matching research acknowledges, and their impact on instance-level schema matching performance is unknown. ASID incorporates a simple data cleaning step before loading instance-level data. In the current implementation, the cleaning process is manual. The impact of data cleaning is only briefly addressed in this paper (see Section IV. C), for more information please see [20]. Loader Internal Representation III. ASID SYSTEM MODULES Information Strong Matching Internal Representation with Results Primary Evaluator List Of good matches Strong matching results Unmatched attributes Prediction Combiner Confident Matches Instance Data Data Cleaner Cleaned Data Clean Data Cleaned Data Relevant instance selector Representation + instances Weak Matching Weak Matching results The current ASID implementation includes two matching techniques; one based on string similarity between the names of the attributes and one based on TF/IDF weighted cosine similarity on the description of the attributes. The system may of course use additional techniques, or replace the existing ones. In principle, schema-based, rather than instance-based, techniques seem to be more appropriate for the strong matching module. More details on the techniques used are available in the following paragraphs. 1) Name Matching Name matching is a string matching technique between the names of the attributes, used extensively in schema matching. We use the Jaro metric for this task as recent research [14] has shown that it is one of the best algorithms for name matching. The Jaro metric is based on the number and order of the common characters between two strings. Given strings s = a1... ak and t = b1... bl, define a character a i in s to be common with t there is a b j = a i in t min( s t ) such that i H j i + H where H = 2 Let s = a... 1 a k ' be the characters in s which are common with t (in the same order they appear in s) and t = b... 1 b L' be defined the same way for t, Now define a transposition for s, t to be a position i such that a i b i. Let Ts', t' be half the number of transpositions for s and t. The Jaro similarity 1 s' t' s' T s ', t' metric for s and t is Jaro(s, t) = s t s It is out of the scope of this text to further analyse the algorithm used. For more information, the reader could consult [14]. In our implementation, scores range in [0, 1] with 1 being a total match. 2) Attribute Description Matching This method tries to exploit simple forms of documentation that often exists in relational systems used in organizations. In most RDBMS deployments, each attribute has a description written in natural language that describes its contents, usually in a sentence or two. Final Matches FIGURE 2. ASID 2-PHASE MATCHING A. Strong Matching Module The internal representation of two schemata (named source and target, according to their role) is fed into the system, and all available techniques are used independently to match each of the matched schema attributes to the corresponding ones of the target one. FIGURE 3. CREATING THE CORPUS Our matching method is a simple one: for each attribute of the source schema, we create a text corpus made of its description string and the description strings of all the 292

4 attributes of the target schema that are possible matches. Then we use TF/IDF weighting to compute vector similarity between the first member of the corpus and the remaining ones: the more similar the description, the higher the score. This method ensures that words appearing many times in the corpus, like table, are not overly important to the computation of similarity, ensuring that the truly important words dominate the result. Preliminary results of this method show very good results -- more than 70% true matches when all attributes have descriptions. Further improvement can be obtained by using dictionaries and synonym tables to aid the identification of descriptions with different wordings. B. Primary Evaluator This is a simple match combiner that uses voting to compute an overall score for each possible match. All scores available from the strong matching technique are added and the sum is divided by their number (currently 2). If after the total evaluation a match is deemed highly probable (with a combined score greater than 0.5), then the evaluator produces a good match, presenting it to the user and removing the matched couple from the source and target schemata. The difference to existing composite systems is that those matches failing to meet the threshold criterion are redirected to the weak matching phase in the hope of finding a better match with its help. It is important to note that all possible (but not actual) matches proceed to be re-evaluated. We also tried to promote only the attribute pairs matches with the highest scores to the second phase, but experiments showed a comparatively larger number of false positives. Hence all nonmatched attribute pairs proceed to the next phase. C. Weak Matching Module The weak matching module is almost identical in structure to the strong one, with one obvious difference: It uses different matchers. Given that in our implementation the weak matchers are data instance-based, the module needs to access and use data instances. The two methods used in the current ASID implementation are described in more detail in the following paragraphs. 1) Naïve Bayes classifier All available data from one schema are fed into a simple naïve bayes classifier; after the learning phase is complete, the sample data of the target attribute are classified. The total score for all attributes is normalized in [0,1] to be compliant in range to the previously computed score. The formula by which the normalization is achieved is a compromise between robustness and ease of implementation. The transformation formula used is si smin si ' = where si is the individual score of an smax smin attribute match and s min and smax respectively are the global minimum and maximum scores of said attribute. The intuition behind this is that the best score for an attribute will be transformed to 1 while the lowest will be 0. It is important to note that this procedure gives a boost to the best score that may lead to the production of false positives. When faced with the dilemma of producing some extra false positives or risk losing some good matches, we opted for the first choice, the rationale being that, in large-scale schema matching, a false positive is easier for a human to dismiss than it is for her to discover a missed true match. The same formula is used for the normalization of scores of the second weak matcher, described below. 2) TF-IDF matching This matcher is inspired by the WHIRL system, which extends relational databases to reason about the similarity of text-valued fields using information-retrieval technology [6]. We use only the similarity function used in WHIRL to create a matcher for ASID. Specifically, data is inserted in the matcher and a collection of documents is created where each document consists of all the instances of any available target attribute. A source document is created from the data available for the source attribute. We use TF/IDF weighting to compute vector similarity scores between the source document in the collection and the target ones. The more similar the source document is to the corpus of the data, the higher the score. The last step is to normalize the results using the formula described above. D. Prediction Combiner This is again a simple voting prediction combiner. All available scores (normalized in [0, 1]) are added and the sum is divided by their number. In the current implementation, the maximum number of scores is four, assuming the availability of data and descriptions. If a score is above 0.5, then we have a possible match. The matching couple is extracted from the pool of available attributes. IV. EXPERIMENTS AND EVALUATION OF THE ASID SYSTEM A. Overview Two distinct and very different schema collections were used in the experiments. The aim was to see the system perform matches for both small (~30 attributes) and medium (~120 attributes) schemas. For the small schema test we used part of the data used in [2] to evaluate the system. The only transformation to the available data was in the way data was made available to the system. For the medium schema experiments, we used real schemas from the Greek cadastre. The Cadastre is a large project many years in development and it involves a large and changing number of subcontractors, and has already had more than one restart. The result is that there is a plethora of different schemas, developed independently but designed to address the same needs. The quality of the available data for each schema is varied as well. These schemas have up to 120 attributes, making them among the largest schemas used in the literature for the evaluation of relational schema matching techniques. 293

5 Experiments were performed on a Pentium IV 3.0GHz running Windows XP. B. Small schema tests The small schema dataset was kindly provided by AnHai Doan and is the relational dataset used in [2]. We first tested the performance of the ASID system, with an eye towards our two research questions. The input schemas ranged from 20 to 28 attributes. The correspondences between the attributes were sometimes complex (rather than one-on-one) although that was not the case often. The following table contains information on the four schemas used and the experiments each one participated in, as numbered in Figure 4 and Figure 5. Because the schemas were used as provided with no further processing, there was no significant overlap among data instances except where such an overlap may be attributed to chance or natural language (i.e., common words in titles of books or fake numbers like ) so instance-level techniques did not help much. Total Data Part of experiment number Atributes instances ,45 KB 1,8,11 as source 4,5,6 as target ,03 KB 2,5,12 as source 7,8,9 as target ,48 KB 3,6,9 as source 10,11,12 as target ,02 KB 4,7,10 as source 1,2,3 as target TABLE 1. SMALL SCHEMA INFORMATION The system, using only 4 simple matchers in the two-phase configuration produced more than 80% of true matches. The following figure shows the performance of the system in each of the 12 experiments conducted. The experiments validate the hypothesis that using a one-phase system would significantly degrade performance. They also show a (small) improvement due to the addition of the weak matchers. In the following figure and in FIGURE 6, Common refers to the number of matching attributes in the two schemas as determined by human experts. Moreover, ASID performance in symmetric test cases (for example 1 and 4, or 8 and 5) was pretty similar, showing the system s robustness to the choice of source and target schemas Common One Phase Matching Only Strong Matchers ASID The accuracy achieved by ASID in the same experiments is summarized in the following figure. Overall , , , , , , ,5 85, , , , , FIGURE 5. SMALL SCHEMAS ASID ACCURACY (%) C. Medium schema tests The experiments were conducted using schemas created during the initial phase of the implementation of the Greek cadastre. The schemas were designed to be used during the data collection processes. The data fed into the system were the distinct values of more than tuples in each case. All schemas and data are real. To our knowledge, this is the first time results are published of a system evaluation against such a real schema set. Our experiments measured first the accuracy of ASID schema matching. Given the existence of a large corpus of instance level data of varying quality, we also tested the impact of dirty data on the quality of the matches. The following table describes the schemas used in the experiments. number Total Attributes Data instances Part of experiment ,39 MB 1,2 as source 3,5 as target ,41 MB 3,4 as source 1,6 as target ,29 MB 5,6 as source 2,4 as target TABLE 2. MEDIUM SCHEMA INFORMATION The results of the experiment show that, despite the increased complexity of real world schemas, accuracy remains high at ~78% The small reduction compared to the small schema experiments shows that ASID is robust with regard to increased schema size. Also notable is that the system remains robust to the choice of source and target schemas Common ASID FIGURE 4. NUMBER OF TRUE POSITIVES ON SMALL SCHEMAS FIGURE 6. NUMBER OF TRUE POSITIVES ON MEDIUM SCHEMAS 294

6 Total , , , , , , , FIGURE 7. MEDIUM SCHEMAS ASID ACCURACY (%) To evaluate the benefit of two-phase matching, and the negative impact of adding the weak matchers in a single-phase composite matching system, we report the results of the experiments from FIGURE 6 when performed with a) a singlephase combination of all matchers and b) using only strong matchers. Results show that ASID outperforms the alternative strategies as is illustrated in FIGURE 8. Although the weak matchers contribute when used in then ASID system, especially when there are enough data available to train them, they often degrade the quality of the produced matches when a one stage matching is performed. Notice that this is despite the availability of large data sets for the instance-based matchers Common Attributes One Phase Matching ASID Strong Only Matchers FIGURE 8. ASID S PERFORMANCE COMPARED TO ALTERNATIVE STRATEGIES To evaluate the impact of dirty data, we identified two experiments, namely experiments 1 and 3, where clean data with considerable overlap were available. In these cases ASID achieves over 90% accuracy. Excluding these two experiments, the average accuracy is 73%. While this experiment is not the last word on the subject, it does provide evidence for the effect of dirty data on the quality of schema matches and the benefits from data cleaning. D. Time and memory performance An additional goal of our work is to demonstrate efficient and effective schema matching performance on run-of-themill hardware. In all small-schema experiments the matching was complete in under a minute, and the memory footprint of the system remained under 22MB. The most important factor for performance was the size of the data instances used in the second phase of the matching. The number of attributes is not as important the most resource-intensive part of the system is the training of the instance-based matchers. Maximum memory usage in the medium schema experiments was between 58 and 136 Mb and maximum CPU time was between 810 and 2012 seconds as reported by Windows. The large data instances are the main reason for the increased resource consumption and time. These results confirms the feasibility of matching relatively large schemas (up to 120 attributes) using few computational resources in reasonable time. V. RELATED WORK In this section, we briefly examine five important schema matching systems that influence the design of ASID: imap [11], LSD [5], COMA++ [9] [17], Similarity Flooding [12], and MKB (Corpus based) [2]. There are a number of other schema matching systems and techniques e.g., [7] [8] [19] designed to address different aspects of the schema matching problem. The above five are the most relevant to ASID as they use a wide variety of relevant approaches. A. imap The main goal of this system is to discover complex matches. A typical example, may be that of an address = concat(city,state). The hardest problem to solve in complex matching is the unlimited number of possible matches one has to examine. To address the above problem the imap system perceives the generation of complex matches as a search in the space of possible matches. The system achieves 43-92% success in matching attributes (1 to 1 or complex). This range shows that the system works but is not as stable as one would want for industrial schema matching. B. Similarity Flooding This is a system based on a graph matching algorithm also called Similarity Flooding (SF). The aim of the algorithm/system is to allow quick development of matchers for a broad spectrum of different scenarios. The system may not outperform custom matchers that are highly tuned for a specific domain. The system is based on the conversion of the matched schemas to directed labelled graphs and the application of fixpoint computation to extract matches. The strength of the algorithm (and also its weakness) is the lack of knowledge it has for the attributes being matched. This had a side effect of producing out of context matches that need to be filtered out. C. COMA++ The basic idea behind COMA++ is to create a true composite system. This is a generic system, designed to be adaptable to many matching problems, achieving notable results not only in data integration but in ontology matching as well [10]. D. LSD The LSD system was proposed in 2003 as a vehicle to demonstrate the idea of a multi-strategy learning approach, in schema matching. While results show that LSD achieves

7 92% accuracy across the four domains mentioned in the papers, the LSD approach mandates the existence of a metalearner and demands training on target schemata, something that often is not possible. E. MKB The MKB approach is based on the observation that schema matching tasks are often repetitive. While this is not true in most cases of enterprise integration, it is often the case when integrating web-based sources. The main component of the system is a training module that trains classifiers for each attribute classified by the system in the past. When a new element arrives, those trained classifiers are applied to it and the degree of similarity of the new element with the old ones is calculated. The system shows high accuracy, either used autonomously or an additional source of information for schema matching when the existence of past mappings is available. As mentioned, it requires a large amount of matched schemata, something not likely to be available in industrial RDBMS deployments. VI. CONCLUSIONS Even though there is a significant body of work on schema matching, some of the more recent and most successful approaches are more geared towards XML and web data integration than enterprise schema matching; they make unrealistic assumptions regarding the availability of training schemas and data instances, and disregard the existence of documentation. We describe ASID, a novel schema-matching system that displays robust accuracy and performance in reallife relational schema matching tasks without requiring matched schemas for training. ASID helps us evaluate the hypothesis that, in the absence of learning prediction combiners, single-phase composite schema matchers can degrade by the inclusion of unreliable, weak schema matchers. The ASID solution is to prioritize known reliable matchers in a first phase of matching, and only use the less reliable or more instance-based matchers in a second matching phase. It also employs a simple matcher that exploits existing schema documentation for matching. Our experiments on both small and relatively large (120 attributes) real-life schemas validate our hypothesis and show the benefits of our solution, including the documentation-based matcher. In particular, they show high accuracy of attribute matches for ASID, and significantly lower accuracy for a single-phase schema matcher that combines all of ASID s individual attribute matchers. The additional contribution of ASID is a data cleaning step to enhance the performance of instance-based matchers; preliminary results confirm the sensitivity of instance-based matchers to dirty data and the benefits of feeding them with clean data. ACKNOWLEDGMENT We would like to thank the authors of [2] for providing us with the relational schema corpus used in their experiments. The second author is supported by a Marie Curie Outgoing International Fellowship. Research support by the PYTHAGORAS EPEAEK programme funded by the EU and the Greek Ministry of Education is gratefully acknowledged. REFERENCES [1] E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4): , 2001 [2] J. Madhavan, P. A. Bernstein, A. Doan, A. Y. Halevy: Corpus-based Matching. Proceedings of ICDE Conf., 2005: [3] P. Shvaiko and J. Euzenat. A Survey of -Based Matching Approaches. In Journal on Data Semantics IV, LNCS 3730, pp , [4] H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In Proceedings of the 2nd Int. Workshop on Web Databases (German Informatics Society), [5] A. Doan, P. Domingos, and A. Halevy. Learning to match the schemas of data sources: A multistrategy approach. Machine Learning, 50: , [6] W. Cohen and H. Hirsh. Joints that generalize: Text classification using WHIRL. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD), [7] J. Madhavan, P.A. Bernstein, E. Rahm: Generic Matching using Cupid. VLDB [8] H.H. Do and E. Rahm: COMA - A System for Flexible Combination of Matching Approaches. Proc. Intl. Conf. Very Large Databases (VLDB), 2002 [9] D. Aumüller, H.H. Do, S. Massmann and E. Rahm: and Ontology Matching with COMA++ (Software Demonstration). Proc. 24. ACM SIGMOD Intl. Conf. Management of Data, 2005 [10] Massmann, S.; Engmann, D.; Rahm, E.:COMA++: Results for the Ontology Alignment Contest OAEI 2006, International Workshop on Ontology Matching, collocated with the 5th ISWC-2006; Athens, Georgia, USA [11] Dhamankar, R.; Lee, Y.; Doan, A.; Halevy, A.; and Domingos, P. imap: Discovering complex matches between database schemas. In Proc. of the ACM SIGMOD Conf [12] Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding - a versatile graph matching algorithm. In: Proc 18th Int Conf Data Eng. [13] Li W, Clifton C. SemInt: a tool for identifying attribute correspondences in heterogeneous databases using neural network. Data Knowl Eng 33(1):49 84, 2000 [14] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg, Adaptive Name Matching in Information Integration, IEEE Intelligent Systems, vol. 18. no. 5 (2003) [15] Li W, Clifton C. SemInt: a tool for identifying attribute correspondences in heterogeneous databases using neural network. Data Knowl Eng 33(1):49 84, 2000 [16] V. Ventrone, S. Heiler, Some advice for dealing with semantic heterogeneity in federated database systems, in: Proceedings of the Database Colloquium, San Diego, August 1994, Armed Forces Communications and Electronics Assc. (AFCEA). [17] H. H. Do, E. Rahm: Matching large schemas: Approaches and evaluation. Information Systems 32(6): (2007) [18] A. Doan and A. Halevy. Semantic-integration research in the database community: A Brief Survey. AI Magazine, 26-1: , 2005 [19] P. Mork, A. Rosenthal, L. Seligman, J. Korb, and K. Samuel. Integration Workbench: Integrating Integration Tools, The MITRE Corporation, Case # , May 2006 [20] N. Bosovic, The ASID schema matching system, MSc Thesis, Department of Informatics, AUEB, 2007 [21] Jacob Berlin, Amihai Motro: Database Matching Using Machine Learning with Feature Selection. A. Banks Pidduck et al. (Eds.): CAISE 2002, LNCS 2348, pp ,

Matching and Alignment: What is the Cost of User Post-match Effort?

Matching and Alignment: What is the Cost of User Post-match Effort? Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,

More information

A Generic Algorithm for Heterogeneous Schema Matching

A Generic Algorithm for Heterogeneous Schema Matching You Li, Dongbo Liu, and Weiming Zhang A Generic Algorithm for Heterogeneous Schema Matching You Li1, Dongbo Liu,3, and Weiming Zhang1 1 Department of Management Science, National University of Defense

More information

XML Schema Matching Using Structural Information

XML Schema Matching Using Structural Information XML Schema Matching Using Structural Information A.Rajesh Research Scholar Dr.MGR University, Maduravoyil, Chennai S.K.Srivatsa Sr.Professor St.Joseph s Engineering College, Chennai ABSTRACT Schema matching

More information

PRIOR System: Results for OAEI 2006

PRIOR System: Results for OAEI 2006 PRIOR System: Results for OAEI 2006 Ming Mao, Yefei Peng University of Pittsburgh, Pittsburgh, PA, USA {mingmao,ypeng}@mail.sis.pitt.edu Abstract. This paper summarizes the results of PRIOR system, which

More information

Lily: Ontology Alignment Results for OAEI 2009

Lily: Ontology Alignment Results for OAEI 2009 Lily: Ontology Alignment Results for OAEI 2009 Peng Wang 1, Baowen Xu 2,3 1 College of Software Engineering, Southeast University, China 2 State Key Laboratory for Novel Software Technology, Nanjing University,

More information

Enabling Product Comparisons on Unstructured Information Using Ontology Matching

Enabling Product Comparisons on Unstructured Information Using Ontology Matching Enabling Product Comparisons on Unstructured Information Using Ontology Matching Maximilian Walther, Niels Jäckel, Daniel Schuster, and Alexander Schill Technische Universität Dresden, Faculty of Computer

More information

Poster Session: An Indexing Structure for Automatic Schema Matching

Poster Session: An Indexing Structure for Automatic Schema Matching Poster Session: An Indexing Structure for Automatic Schema Matching Fabien Duchateau LIRMM - UMR 5506 Université Montpellier 2 34392 Montpellier Cedex 5 - France duchatea@lirmm.fr Mark Roantree Interoperable

More information

XBenchMatch: a Benchmark for XML Schema Matching Tools

XBenchMatch: a Benchmark for XML Schema Matching Tools XBenchMatch: a Benchmark for XML Schema Matching Tools Fabien Duchateau, Zohra Bellahsene, Ela Hunt To cite this version: Fabien Duchateau, Zohra Bellahsene, Ela Hunt. XBenchMatch: a Benchmark for XML

More information

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Jorge Gracia, Eduardo Mena IIS Department, University of Zaragoza, Spain {jogracia,emena}@unizar.es Abstract. Ontology matching, the task

More information

A Flexible Approach for Planning Schema Matching Algorithms

A Flexible Approach for Planning Schema Matching Algorithms A Flexible Approach for Planning Schema Matching Algorithms Fabien Duchateau, Zohra Bellahsene, Remi Coletta To cite this version: Fabien Duchateau, Zohra Bellahsene, Remi Coletta. A Flexible Approach

More information

A Tagging Approach to Ontology Mapping

A Tagging Approach to Ontology Mapping A Tagging Approach to Ontology Mapping Colm Conroy 1, Declan O'Sullivan 1, Dave Lewis 1 1 Knowledge and Data Engineering Group, Trinity College Dublin {coconroy,declan.osullivan,dave.lewis}@cs.tcd.ie Abstract.

More information

By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz

By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos. Presented by Yael Kazaz By Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy and Pedro Domingos Presented by Yael Kazaz Example: Merging Real-Estate Agencies Two real-estate agencies: S and T, decide to merge Schema T has

More information

Deduplication of Hospital Data using Genetic Programming

Deduplication of Hospital Data using Genetic Programming Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

RiMOM Results for OAEI 2008

RiMOM Results for OAEI 2008 RiMOM Results for OAEI 2008 Xiao Zhang 1, Qian Zhong 1, Juanzi Li 1, Jie Tang 1, Guotong Xie 2 and Hanyu Li 2 1 Department of Computer Science and Technology, Tsinghua University, China {zhangxiao,zhongqian,ljz,tangjie}@keg.cs.tsinghua.edu.cn

More information

InsMT / InsMTL Results for OAEI 2014 Instance Matching

InsMT / InsMTL Results for OAEI 2014 Instance Matching InsMT / InsMTL Results for OAEI 2014 Instance Matching Abderrahmane Khiat 1, Moussa Benaissa 1 1 LITIO Lab, University of Oran, BP 1524 El-Mnaouar Oran, Algeria abderrahmane_khiat@yahoo.com moussabenaissa@yahoo.fr

More information

NAME SIMILARITY MEASURES FOR XML SCHEMA MATCHING

NAME SIMILARITY MEASURES FOR XML SCHEMA MATCHING NAME SIMILARITY MEASURES FOR XML SCHEMA MATCHING Ali El Desoukey Mansoura University, Mansoura, Egypt Amany Sarhan, Alsayed Algergawy Tanta University, Tanta, Egypt Seham Moawed Middle Delta Company for

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Object Matching for Information Integration: A Profiler-Based Approach

Object Matching for Information Integration: A Profiler-Based Approach Object Matching for Information Integration: A Profiler-Based Approach AnHai Doan Ying Lu Yoonkyong Lee Jiawei Han {anhai,yinglu,ylee11,hanj}@cs.uiuc.edu Department of Computer Science University of Illinois,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK REVIEW PAPER ON IMPLEMENTATION OF DOCUMENT ANNOTATION USING CONTENT AND QUERYING

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

LiSTOMS: a Light-weighted Self-tuning Ontology Mapping System

LiSTOMS: a Light-weighted Self-tuning Ontology Mapping System 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology LiSTOMS: a Light-weighted Self-tuning Ontology Mapping System Zhen Zhen Junyi Shen Institute of Computer

More information

Interoperability Issues, Ontology Matching and MOMA

Interoperability Issues, Ontology Matching and MOMA Interoperability Issues, Ontology Matching and MOMA Malgorzata Mochol (Free University of Berlin, Königin-Luise-Str. 24-26, 14195 Berlin, Germany mochol@inf.fu-berlin.de) Abstract: Thought interoperability

More information

Comparison of Schema Matching Evaluations

Comparison of Schema Matching Evaluations Comparison of Schema Matching Evaluations Hong-Hai Do, Sergey Melnik, and Erhard Rahm Department of Computer Science, University of Leipzig Augustusplatz 10-11, 04109 Leipzig, Germany {hong, melnik, rahm}@informatik.uni-leipzig.de

More information

Computational Cost of Querying for Related Entities in Different Ontologies

Computational Cost of Querying for Related Entities in Different Ontologies Computational Cost of Querying for Related Entities in Different Ontologies Chung Ming Cheung Yinuo Zhang Anand Panangadan Viktor K. Prasanna University of Southern California Los Angeles, CA 90089, USA

More information

Matching Large XML Schemas

Matching Large XML Schemas Matching Large XML Schemas Erhard Rahm, Hong-Hai Do, Sabine Maßmann University of Leipzig, Germany rahm@informatik.uni-leipzig.de Abstract Current schema matching approaches still have to improve for very

More information

Yet Another Matcher. Fabien Duchateau, Remi Coletta, Zohra Bellahsene, Renée Miller

Yet Another Matcher. Fabien Duchateau, Remi Coletta, Zohra Bellahsene, Renée Miller Yet Another Matcher Fabien Duchateau, Remi Coletta, Zohra Bellahsene, Renée Miller To cite this version: Fabien Duchateau, Remi Coletta, Zohra Bellahsene, Renée Miller. Yet Another Matcher. RR-916, 29.

More information

Leveraging Data and Structure in Ontology Integration

Leveraging Data and Structure in Ontology Integration Leveraging Data and Structure in Ontology Integration O. Udrea L. Getoor R.J. Miller Group 15 Enrico Savioli Andrea Reale Andrea Sorbini DEIS University of Bologna Searching Information in Large Spaces

More information

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING

More information

QuickMig - Automatic Schema Matching for Data Migration Projects

QuickMig - Automatic Schema Matching for Data Migration Projects QuickMig - Automatic Schema Matching for Data Migration Projects Christian Drumm SAP Research, Karlsruhe, Germany christian.drumm@sap.com Matthias Schmitt SAP AG, St.Leon-Rot, Germany ma.schmitt@sap.com

More information

Combining Multiple Query Interface Matchers Using Dempster-Shafer Theory of Evidence

Combining Multiple Query Interface Matchers Using Dempster-Shafer Theory of Evidence Combining Multiple Query Interface Matchers Using Dempster-Shafer Theory of Evidence Jun Hong, Zhongtian He and David A. Bell School of Electronics, Electrical Engineering and Computer Science Queen s

More information

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Learning to Match the Schemas of Data Sources: A Multistrategy Approach Machine Learning, 50, 279 301, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Learning to Match the Schemas of Data Sources: A Multistrategy Approach ANHAI DOAN anhai@cs.uiuc.edu

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

A Tool for Semi-Automated Semantic Schema Mapping: Design and Implementation

A Tool for Semi-Automated Semantic Schema Mapping: Design and Implementation A Tool for Semi-Automated Semantic Schema Mapping: Design and Implementation Dimitris Manakanatas, Dimitris Plexousakis Institute of Computer Science, FO.R.T.H. P.O. Box 1385, GR 71110, Heraklion, Greece

More information

XML Grammar Similarity: Breakthroughs and Limitations

XML Grammar Similarity: Breakthroughs and Limitations XML Grammar Similarity: Breakthroughs and Limitations Joe TEKLI, Richard CHBEIR* and Kokou YETONGNON LE2I Laboratory University of Bourgogne Engineer s Wing BP 47870 21078 Dijon CEDEX FRANCE Phone: (+33)

More information

ALIN Results for OAEI 2017

ALIN Results for OAEI 2017 ALIN Results for OAEI 2017 Jomar da Silva 1, Fernanda Araujo Baião 1, and Kate Revoredo 1 Graduated Program in Informatics, Department of Applied Informatics Federal University of the State of Rio de Janeiro

More information

Anchor-Profiles for Ontology Mapping with Partial Alignments

Anchor-Profiles for Ontology Mapping with Partial Alignments Anchor-Profiles for Ontology Mapping with Partial Alignments Frederik C. Schadd Nico Roos Department of Knowledge Engineering, Maastricht University, Maastricht, The Netherlands Abstract. Ontology mapping

More information

Matching Schemas for Geographical Information Systems Using Semantic Information

Matching Schemas for Geographical Information Systems Using Semantic Information Matching Schemas for Geographical Information Systems Using Semantic Information Christoph Quix, Lemonia Ragia, Linlin Cai, and Tian Gan Informatik V, RWTH Aachen University, Germany {quix, ragia, cai,

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

OntoDNA: Ontology Alignment Results for OAEI 2007

OntoDNA: Ontology Alignment Results for OAEI 2007 OntoDNA: Ontology Alignment Results for OAEI 2007 Ching-Chieh Kiu 1, Chien Sing Lee 2 Faculty of Information Technology, Multimedia University, Jalan Multimedia, 63100 Cyberjaya, Selangor. Malaysia. 1

More information

The Results of Falcon-AO in the OAEI 2006 Campaign

The Results of Falcon-AO in the OAEI 2006 Campaign The Results of Falcon-AO in the OAEI 2006 Campaign Wei Hu, Gong Cheng, Dongdong Zheng, Xinyu Zhong, and Yuzhong Qu School of Computer Science and Engineering, Southeast University, Nanjing 210096, P. R.

More information

Outline A Survey of Approaches to Automatic Schema Matching. Outline. What is Schema Matching? An Example. Another Example

Outline A Survey of Approaches to Automatic Schema Matching. Outline. What is Schema Matching? An Example. Another Example A Survey of Approaches to Automatic Schema Matching Mihai Virtosu CS7965 Advanced Database Systems Spring 2006 April 10th, 2006 2 What is Schema Matching? A basic problem found in many database application

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

HotMatch Results for OEAI 2012

HotMatch Results for OEAI 2012 HotMatch Results for OEAI 2012 Thanh Tung Dang, Alexander Gabriel, Sven Hertling, Philipp Roskosch, Marcel Wlotzka, Jan Ruben Zilke, Frederik Janssen, and Heiko Paulheim Technische Universität Darmstadt

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

An Ameliorated Methodology to Eliminate Redundancy in Databases Using SQL

An Ameliorated Methodology to Eliminate Redundancy in Databases Using SQL An Ameliorated Methodology to Eliminate Redundancy in Databases Using SQL Praveena M V 1, Dr. Ajeet A. Chikkamannur 2 1 Department of CSE, Dr Ambedkar Institute of Technology, VTU, Karnataka, India 2 Department

More information

Collaborative Framework for Testing Web Application Vulnerabilities Using STOWS

Collaborative Framework for Testing Web Application Vulnerabilities Using STOWS Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Improving Origin Analysis with Weighting Functions

Improving Origin Analysis with Weighting Functions Improving Origin Analysis with Weighting Functions Lin Yang, Anwar Haque and Xin Zhan Supervisor: Michael Godfrey University of Waterloo Introduction Software systems must undergo modifications to improve

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Comprehensive and Progressive Duplicate Entities Detection

Comprehensive and Progressive Duplicate Entities Detection Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology

More information

38050 Povo Trento (Italy), Via Sommarive 14 Fausto Giunchiglia, Pavel Shvaiko and Mikalai Yatskevich

38050 Povo Trento (Italy), Via Sommarive 14   Fausto Giunchiglia, Pavel Shvaiko and Mikalai Yatskevich UNIVERSITY OF TRENTO DEPARTMENT OF INFORMATION AND COMMUNICATION TECHNOLOGY 38050 Povo Trento (Italy), Via Sommarive 14 http://www.dit.unitn.it SEMANTIC MATCHING Fausto Giunchiglia, Pavel Shvaiko and Mikalai

More information

arxiv: v1 [cs.db] 23 Feb 2016

arxiv: v1 [cs.db] 23 Feb 2016 SIFT: An Algorithm for Extracting Structural Information From Taxonomies Jorge Martinez-Gil, Software Competence Center Hagenberg (Austria), jorgemar@acm.org Keywords: Algorithms; Knowledge Engineering;

More information

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets 2016 IEEE 16th International Conference on Data Mining Workshops Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets Teruaki Hayashi Department of Systems Innovation

More information

Learning mappings and queries

Learning mappings and queries Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language

More information

ALIN Results for OAEI 2016

ALIN Results for OAEI 2016 ALIN Results for OAEI 2016 Jomar da Silva, Fernanda Araujo Baião and Kate Revoredo Department of Applied Informatics Federal University of the State of Rio de Janeiro (UNIRIO), Rio de Janeiro, Brazil {jomar.silva,fernanda.baiao,katerevoredo}@uniriotec.br

More information

CONCEPTUAL DESIGN FOR SOFTWARE PRODUCTS: SERVICE REQUEST PORTAL. Tyler Munger Subhas Desa

CONCEPTUAL DESIGN FOR SOFTWARE PRODUCTS: SERVICE REQUEST PORTAL. Tyler Munger Subhas Desa CONCEPTUAL DESIGN FOR SOFTWARE PRODUCTS: SERVICE REQUEST PORTAL Tyler Munger Subhas Desa Real World Problem at Cisco Systems Smart Call Home (SCH) is a component of Cisco Smart Services that offers proactive

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

RiMOM Results for OAEI 2009

RiMOM Results for OAEI 2009 RiMOM Results for OAEI 2009 Xiao Zhang, Qian Zhong, Feng Shi, Juanzi Li and Jie Tang Department of Computer Science and Technology, Tsinghua University, Beijing, China zhangxiao,zhongqian,shifeng,ljz,tangjie@keg.cs.tsinghua.edu.cn

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Similarity Flooding: A versatile Graph Matching Algorithm and its Application to Schema Matching

Similarity Flooding: A versatile Graph Matching Algorithm and its Application to Schema Matching Similarity Flooding: A versatile Graph Matching Algorithm and its Application to Schema Matching Sergey Melnik, Hector Garcia-Molina (Stanford University), and Erhard Rahm (University of Leipzig), ICDE

More information

Managing Uncertainty in Schema Matcher Ensembles

Managing Uncertainty in Schema Matcher Ensembles Managing Uncertainty in Schema Matcher Ensembles Anan Marie and Avigdor Gal Technion Israel Institute of Technology {sananm@cs,avigal@ie}.technion.ac.il Abstract. Schema matching is the task of matching

More information

The HMatch 2.0 Suite for Ontology Matchmaking

The HMatch 2.0 Suite for Ontology Matchmaking The HMatch 2.0 Suite for Ontology Matchmaking S. Castano, A. Ferrara, D. Lorusso, and S. Montanelli Università degli Studi di Milano DICo - Via Comelico, 39, 20135 Milano - Italy {castano,ferrara,lorusso,montanelli}@dico.unimi.it

More information

Developing an integrated approach to the analysis of MOD cyber-related risks

Developing an integrated approach to the analysis of MOD cyber-related risks Developing an integrated approach to the analysis of MOD cyber-related risks James Tate, Colette Jeffery Joint Enablers Analysis Group 28 th July 2016 COVERING Overview 1. risk research 2. Customer requirement

More information

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM SUHAS PATIL [M.Tech Scholar, Department Of Computer Science &Engineering, RKDF IST, Bhopal, RGPV University, India] Dr.Varsha Namdeo [Assistant Professor, Department

More information

A Visual Tool for Supporting Developers in Ontology-based Application Integration

A Visual Tool for Supporting Developers in Ontology-based Application Integration A Visual Tool for Supporting Developers in Ontology-based Application Integration Tobias Wieschnowsky 1 and Heiko Paulheim 2 1 SAP Research tobias.wieschnowsky@sap.com 2 Technische Universität Darmstadt

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

National Data Sharing and Accessibility Policy-2012 (NDSAP-2012)

National Data Sharing and Accessibility Policy-2012 (NDSAP-2012) National Data Sharing and Accessibility Policy-2012 (NDSAP-2012) Department of Science & Technology Ministry of science & Technology Government of India Government of India Ministry of Science & Technology

More information

AROMA results for OAEI 2009

AROMA results for OAEI 2009 AROMA results for OAEI 2009 Jérôme David 1 Université Pierre-Mendès-France, Grenoble Laboratoire d Informatique de Grenoble INRIA Rhône-Alpes, Montbonnot Saint-Martin, France Jerome.David-at-inrialpes.fr

More information

Similarity Joins of Text with Incomplete Information Formats

Similarity Joins of Text with Incomplete Information Formats Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

µbe: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration

µbe: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration µbe: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration Ashraf Aboulnaga Kareem El Gebaly University of Waterloo {ashraf, kelgebal}@cs.uwaterloo.ca Abstract The typical

More information

SeMap: A Generic Schema Matching System

SeMap: A Generic Schema Matching System SeMap: A Generic Schema Matching System by Ting Wang B.Sc., Zhejiang University, 2004 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Ontology Matching Using an Artificial Neural Network to Learn Weights

Ontology Matching Using an Artificial Neural Network to Learn Weights Ontology Matching Using an Artificial Neural Network to Learn Weights Jingshan Huang 1, Jiangbo Dang 2, José M. Vidal 1, and Michael N. Huhns 1 {huang27@sc.edu, jiangbo.dang@siemens.com, vidal@sc.edu,

More information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,

More information

Towards Rule Learning Approaches to Instance-based Ontology Matching

Towards Rule Learning Approaches to Instance-based Ontology Matching Towards Rule Learning Approaches to Instance-based Ontology Matching Frederik Janssen 1, Faraz Fallahi 2 Jan Noessner 3, and Heiko Paulheim 1 1 Knowledge Engineering Group, TU Darmstadt, Hochschulstrasse

More information

Ontology matching using vector space

Ontology matching using vector space University of Wollongong Research Online University of Wollongong in Dubai - Papers University of Wollongong in Dubai 2008 Ontology matching using vector space Zahra Eidoon University of Tehran, Iran Nasser

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

BaggTaming Learning from Wild and Tame Data

BaggTaming Learning from Wild and Tame Data BaggTaming Learning from Wild and Tame Data Wikis, Blogs, Bookmarking Tools - Mining the Web 2.0 Workshop @ECML/PKDD2008 Workshop, 15/9/2008 Toshihiro Kamishima, Masahiro Hamasaki, and Shotaro Akaho National

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Intelligent Methods in Virtual Databases

Intelligent Methods in Virtual Databases Intelligent Methods in Virtual Databases Amihai Motro, Philipp Anokhin and Jacob Berlin Department of Information and Software Engineering George Mason University, Fairfax, VA 22030, USA Abstract Considerable

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection Marnix de Bakker, Flavius Frasincar, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands

More information

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm S.Pradeepkumar*, Mrs.C.Grace Padma** M.Phil Research Scholar, Department of Computer Science, RVS College of

More information

A Framework for Domain-Specific Interface Mapper (DSIM)

A Framework for Domain-Specific Interface Mapper (DSIM) 56 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 A Framework for Domain-Specific Interface Mapper (DSIM) Komal Kumar Bhatia1 and A. K. Sharma2, YMCA

More information

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece

More information

Still Image Objective Segmentation Evaluation using Ground Truth

Still Image Objective Segmentation Evaluation using Ground Truth 5th COST 276 Workshop (2003), pp. 9 14 B. Kovář, J. Přikryl, and M. Vlček (Editors) Still Image Objective Segmentation Evaluation using Ground Truth V. Mezaris, 1,2 I. Kompatsiaris 2 andm.g.strintzis 1,2

More information

Ontology Matching Techniques: a 3-Tier Classification Framework

Ontology Matching Techniques: a 3-Tier Classification Framework Ontology Matching Techniques: a 3-Tier Classification Framework Nelson K. Y. Leung RMIT International Universtiy, Ho Chi Minh City, Vietnam nelson.leung@rmit.edu.vn Seung Hwan Kang Payap University, Chiang

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

imap: Discovering Complex Semantic Matches between Database Schemas

imap: Discovering Complex Semantic Matches between Database Schemas imap: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA {dhamanka,ylee11,anhai}@cs.uiuc.edu

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

A Hybrid Face Detection System using combination of Appearance-based and Feature-based methods

A Hybrid Face Detection System using combination of Appearance-based and Feature-based methods IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.5, May 2009 181 A Hybrid Face Detection System using combination of Appearance-based and Feature-based methods Zahra Sadri

More information

Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships

Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard Dragut 1 and Ramon Lawrence 2 1 Department of Computer Science, University of Illinois at Chicago 2 Department

More information

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

MERGING BUSINESS VOCABULARIES AND RULES

MERGING BUSINESS VOCABULARIES AND RULES MERGING BUSINESS VOCABULARIES AND RULES Edvinas Sinkevicius Departament of Information Systems Centre of Information System Design Technologies, Kaunas University of Lina Nemuraite Departament of Information

More information