Quality of Matching in large scale scenarios

Size: px

Start display at page:

Download "Quality of Matching in large scale scenarios"

Cynthia Casey
5 years ago
Views:

1 Quality of Matching in large scale scenarios Sana Sellami : Université de Lyon, CNRS INSA-Lyon, LIRIS, UMR5205, F-69621, France. sana.sellami@insa-lyon.fr Nabila Benharkat : Université de Lyon, CNRS INSA-Lyon, LIRIS, UMR5205, F-69621, France. nabila.benharkat@insa-lyon.fr Youssef Amghar : Université de Lyon, CNRS INSA-Lyon, LIRIS, UMR5205, F-69621, France. ABSTRACT. Matching Techniques are becoming a very attractive research topic. With the development and the use of a large variety of data (e.g. DB schemas, XML schemas, ontologies), in many domains (e.g. semantic web, E-business, etc), matching techniques are called to overcome the challenge of aligning these different data. In this paper, we are interested in studying the quality of large scale matching systems. We define and propose a quality of Matching (QoM) that can be used to evaluate large scale matching systems. We survey the techniques, called optimization techniques, used in existing matching approaches to improve this quality. One can acknowledge that this domain is on top of effervescence and large scale matching need much more advances. So, we demonstrate how quality evaluation can be integrated in our scalable matching system PLASMA. KEYWORDS: Matching, Quality of Matching (QoM), Large Scale, Optimization techniques.

2 1. Introduction Recently, in the business and scientific area, we are witnessing an explosive growth of data. In fact, there are many databases and information sources available through the web covering different domains: semantic Web, deep Web, e-business, biology, digital libraries, etc. In such domains, the data generated are heterogeneous and voluminous e.g schemas with several thousand elements are common in e-business applications. The presence of vast heterogeneous collections of data arises one of the greatest challenges in data integration field. Hence, matching techniques are solutions to automatically search correspondences between these data in order to obtain useful information. Schema matching has found considerable interest in both research and practice. In fact, matching is an operation that takes data as input (e.g XML schemas, ontologies, relational database schemas) and returns the semantic similarity values of their elements. One of the challenges of the matching community is to efficiently search correspondences between several and voluminous schemas. However, matching these data at large scale represents a laborious process. The standard approach trying to match the complete input schemas will often lead to performance problems. Various schema matching systems have been developed to solve the problem semi-automatically. Since schema matching is a semi-automatic task, efficient implementations are required to support interactive user feedback. In this context, scalable matching becomes a problem to be solved. Our main motivation is then to optimize and improve scalable matching algorithms in terms of efficiency and the quality of matching solutions produce. We begin by defining the quality of matching (QoM). We propose a quality of Matching (QoM) classification in terms of factors and metrics that can be used to evaluate matching systems and to ensure high outcome for large scale matching. Moreover, we expose in detail the proposed techniques in the literature in both pairwise and holistic approaches to improve the quality of matching and to deal with scalability and performance problems. This analysis of state of the art techniques allows us to make some conclusions and observations about the quality of existing matching systems. Depending on these observations, we describe how quality of matching can be integrated in our scalable schema matching system PLASMA (PLatform for LArge Scale MAtching). The goal of our paper is to show the importance of quality of matching to ensure an efficient scalable matching system. The paper is organized as follows. In section 2, we define quality of Matching (QoM). We propose then a classification in terms of metrics and factors to evaluate the quality and we analyse the different existing techniques in the literature to improve this quality. Section 3 presents the important role of quality in our scalable matching system PLASMA. Finally, we conclude and discuss future works. 2. Quality of Matching (QoM) In the large scale context, we define and propose a Quality of Matching (QoM) which represents the reliability and robustness of large scale matching systems. In fact, we estimate that it is important and interesting to relate the quality aspect to the scalable matching techniques. The quality assessment brings to the users an optimal solution to accomplish their needs. Therefore, QoM means for us an optimization of the large scale matching system. We analyse and define in this section two main aspects of QoM: how evaluating quality of Matching and how improving the quality of matching in large scale matching scenarios.

3 2.1. Quality of matching evaluation Evaluations of schema matching systems have been deeply studied in (Do et al. discussing various aspects (input, output, match quality measures, effort) that contribute 2002) to the match quality obtained as the result of an evaluation. The quality concept has been used in several domains as an important phase of evaluation in the current information systems. There are a variety of approaches to study the quality of data in information integration and data search. However, there exists little work, which tackles the quality aspect in the matching process at the large scale. The authors (Bernstein et al., 2004) test their system taking into account the scalability and extensibility criteria. Practically, all matchers are evaluated using precision and recall measures. For example, the authors (Smiljanic et al., 2006) propose the evaluation of quality in terms of performance. The performance of a schema matching system consists of efficiency (which expresses how much one system performs faster than the other) and effectiveness (expressed through precision and recall). In (Duchateau et al., 2007), the authors propose quality measures of matching using a number of scoring functions. More specially, the quality of Matching (QoM) is based on the use of quality measures to evaluate the matching system. Firstly, we need to identify which quality factors to be evaluated. The selection of the appropriate quality factors implies the selection of metrics and the implementation of evaluation algorithms that measure and estimate such quality factors. In this respect, a metric is a specific instrument that can be used to measure a given quality factor. We distinguish between two aspects (Figure. 1): the factors that influence the quality and the metrics to evaluate and measure the quality of the matching techniques. We propose the factors that mainly depend on the quality of the context (input data and the characteristics of the domain) and the features of matching systems and algorithms. On the other hand, we define the metrics (performance, accuracy, scalability, etc) in term of the characteristics of the matching process that builds the resulting data from sources Quality factors in large scale matching The factors that have an influence on a large scale are essentially related to the context (input data and domain) and matching systems or algorithms. We summarize these quality factors in the following paragraph. Factors related to the context Input data: Quality of matching depends on the internal quality of the data (their coherence, their completeness, their freshness, etc.), and on the confidence about producers of these data. Moreover, we should determine the type, representation and structure of data that have been used (schemas, ontologies, query interfaces etc). These characteristics influence the quality of matching. Domain: Data reside at different sources and consequently are extracted from different domains. Data managed by different sources are typically heterogeneous, and data can be incorrect, incomplete, and noisy, that is, it may be data of poor quality. Therefore, it is important to determine if the data source result from different or the same domains, the characteristic of domains, etc.

4 Factors related to the matching systems/ algorithms Techniques: In a context where the information is produced by sophisticated algorithms, the quality measurement requires a fine knowledge of the computing process of this information. Moreover, the use of these algorithms and techniques (i.e. the type of the matchers implemented (schema vs. instance level, element vs. structural level, language vs. constraint based, etc), auxiliary information, optimization techniques, etc.) could be very expensive. Needs in Runtime performance: The quality of matching solutions is measured in terms of how long applications take to be run to completion when tasks of applications are allocated to nodes based on decisions of matching algorithms. This duration is called execution time. Efficient matching algorithms must keep times to a minimum. Complexity: The matching problem is an extreme case in terms of size and complexity. In fact, the schema matching problem is a combinatorial problem with an exponential complexity. This complexity is due to the large number and size of data (number of schemas/components), and the expensive computation of semantic similarity (e.g using the auxiliary resources). Consequently, this makes the naive matching algorithms for large schemas prohibitively inefficient. Therefore, the complexity is a property that affects the quality of matching algorithms. Human interaction (Wang et al., 2007): Matching operation cannot be entirely automated; it is still largely conducted by hand, in a labor-intensive and error-prone process. The manual matching has now become a key bottleneck in building large-scale information management systems. Therefore, user or designer input is necessary to generate correct matchings Quality metrics in large scale matching We define the metrics that are involved individually in existing large scale matching systems evaluations. Our classification (figure.1) could be a support to QoM. We propose the metrics (performance, accuracy, scalability, etc) in term of the characteristics of the matching process that builds the resulting data from sources. Performance: The performance is measured in terms of efficiency and pertinence: Efficiency: It is the time the system needs to solve a matching problem. Pertinence: Evaluates the relevance of matching results. This metric can be calculated by precision and recall values (Do et al., 2002). In order to compare the quality of the matching, we have established a manual matching as referential. Therefore, the results obtained by the automatic matching are separately checked with respect to three quality measurements: Precision, Recall and Overall.

5 Accuracy: Called also Overall has been proposed by (Melnik et al., 2002) specifically in schema matching context. This measure considers the post-match effort needed. Accuracy depends on both Recall and Precision measures. Manual effort (Wang et al., 2007): It is very important to specify the kind of manual effort during the pre-matching process and the post- matching process. This metric is a type of cost metric that estimates the human part of the cost and typically measured in person-days or person-months (spent time in correction and improvement of the matching output). Scalability: It is a property of systems to keep functioning correctly even with the adding new elements. A system, whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system. In our context, an algorithm is said to be to scale if it is suitably efficient and practical when applied to large situations (e.g. large input data set or large number of participating nodes in the case of a distributed system). For a given matching algorithm implementation on a given machine, let T(A,S) be the execution time with the algorithm A performed on S schemas, and T(A,S') be the execution time with the algorithm A performed on S schemas with (S'>S). If the number of schemas increases from S to S', and the efficiency E is conserved, then we can define scalability metric or time scale TS as follows: T(A,S) time-scale E,(S,S') = T(A,S') The value of time-scale E, (S,S') is inevitably less than 1, otherwise the algorithm is time constant; but this is impossible currently with all algorithms of the literature. A large value means a good scalability of the algorithm applied to a large number of schemas, and a small value means a poor scalability. Adaptability (Bharadwaj et al., 2004): Refers to the degree to which adjustments in practices, processes, or structures of systems are possible to projected or actual changes of their environment. This criterion could measure the degree of change that a system can support. Extensibility: Means that the system has been so architected that the design includes all of the hooks and mechanisms for expanding/enhancing the system with new capabilities without having to make major changes to the system infrastructure. Therefore, matching systems should be extended by adding matching techniques, algorithms or customized data structures and operators.

6 Figure 1. Quality of matching (QoM) : factors and metrics 2.2 Quality of matching improvement in large scale The goal of this section is to analyse the different strategies and techniques used in existing matching approaches (pair-wise and holistic) to improve the quality of matching. To this end, we analyse and underline the importance and the usefulness of the used techniques and strategies, called optimization techniques QoM improvement in Pair-wise matching Matching has been approached mainly by finding pair-wise attribute correspondences, to construct an integrated schema for two sources. Several pair-wise matching approaches over schemas and ontologies have been developed. Schema matching Being a central process for several research topics like data integration, data transformation, schema evolution, etc, schema matching (figure.2) has attracted much attention by research community (Avesani et al,2005; Bernstein et al, 2004; Do et al, 2007; Lu & Wang, 2005; Smiljanic et al, 2006). We present the main strategies dealing with quality (e.g scalability) problem. These strategies represent an effective attempt to resolve large scale matching problem. The used techniques aim at improving the quality of matching: Fragment based strategy (Rahm et al, 2004): This is a divide and conquer approach which decomposes a large matching problem into smaller sub-problems by matching at the level of schema fragments. This approach has been implemented in COMA++ (Do et al, 2007) matching tool. The fragment-based approach represents an effective solution to treat large schemas and to improve the performance of matching algorithms. Extraction of common structures (Lu & Wang, 2005): The main goal of this approach is to extract a disjoint set of the largest approximate common substructures between two trees. This set of common structures represents the most likely matches between substructures in the two schemas. Identifying these structures aim at

7 improving the efficiency of matching process. However, there is no proof of correctness of this proposed approach. Clustered schema matching strategy (Avesani et al, 2005): This is a technique for improving the efficiency of schema matching by means of clustering. In this approach, matching is achieved between a small schema and a schema repository. The clustering is introduced after the generation of matching elements. Clustering is then used to quickly identify regions in the schema repository which are likely to include good matchings for the smaller schema. The clustered schema matching is achieved by the clustering algorithm K-means (Xu & Wunsch, 2005). The authors choose an adaptation of the k-means clustering algorithm. Bellflower system implements this technique. The improved efficiency, however, comes at the cost of the loss of some matching. The loss mostly occurs among the matchings which rank low. However, there is no measure of a cluster s quality that can be used to decide which clusters have better chances to produce good matchings. Figure 2. Pair-wise schema matching Ontology matching Ontology matching (figure. 3) is a promising solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of the ontologies. These correspondences can be used for various tasks, such as ontology merging, query answering, data translation, or for navigation on the semantic web. Thus, matching ontologies enables the knowledge and data expressed in the matched ontologies to interoperate. The increasing awareness of the benefits of ontologies for information processing has led to the creation of a number of large ontologies about real world domains. The size of these ontologies causes serious problems in managing them. Actually, many approaches (Hu & Qu, 2006; Hu et al, 2006; Qu et al, 2006; Stuckenschmidt & Klein, 2004; Wang et al, 2006a; Wang et al, 2006b) have been proposed in literature to study the large ontology matching problem. We describe her existing approaches and techniques aiming to improve the quality of large scale matching ontologies: Partitioning strategy (Hu & Qu, 2006): have been introduced this strategy as a method for partition-based block matching that is appropriate to large class hierarchies. Large class hierarchies are one of the most common kinds of large-scale ontologies. The two large class hierarchies are partitioned, based on both structural affinities and linguistic similarities, a priori into small blocks respectively. The matching process is then achieved between blocks by combining the two kinds of

relatedness found via predefined anchors and virtual documents between them. The partitioning process is realized based on ROCK (Robust Clustering Using Links) algorithm (Xu & Wunsch, 2005).

8 relatedness found via predefined anchors and virtual documents between them. The partitioning process is realized based on ROCK (Robust Clustering Using Links) algorithm (Xu & Wunsch, 2005). However, this approach is not completely applicable to large ontologies and it partitions two large class hierarchies separately without considering the correspondences between them. In addition, it only assumes matchings between classes, thus it is not a general solution for ontology matching. To cope with large ontologies matching, Hu & Qu (2006) then propose a partitioningbased approach to address the block matching problem. Qu et al (2006) consider both linguistic and structural characteristics of domain entities based on virtual documents for the relatedness measure. Partitioning ontologies is achieved by a hierarchical bisection algorithm to provide block mappings. Modularization strategy: Wang et al (2006) propose this approach to deal with large and complex ontologies. The authors propose a Modularization-based Ontology Matching approach (MOM). This is a divide-and-conquer strategy which decomposes a large matching problem into smaller sub-problems by matching at the level of ontology modules. This approach includes sub-steps for large ontology partitioning, finding similar modules, module matching and result combination. This method uses the ε -connection to transform the input ontology into an ε -connection with the largest possible number of connected knowledge bases. Figure 3. Pair-wise ontology matching QoM improvement in Holistic matching Traditional schema matching research has been found by pair-wise approach. Recently, holistic schema matching has received much attention due to its efficiency in exploring the contextual information and scalability. Holistic matching (figure.4) matches multiple schemas at the same time to find attribute correspondences among all the schemas at once. These schemas are usually extracted from web query interfaces in the deep Web. The deep Web refers to World Wide Web content not part of the surface Web indexed by search engines. The data sources in the deep Web are structured and accessible only via dynamic queries instead of static URL links. Several current approaches to holistic schema matching (He et al, 2006; He et al, 2004; He et al, 2003; He et al, 2005; Madhavan et al, 2005; Pei et al, 2006a; Pei et al, 2006b; Su et al, 2006a; Su et al, 2006b) rely on a large amount of data to discover semantic correspondences between attributes. We describe the most important strategies proposed in the literature and we highlight the used techniques to improve the quality of holistic matching. Statistical strategy: This approach has been introduced in (He et al, 2004; He et al, 2003) with MGS (for hypothesis modeling, generation, and selection) and a

9 DCM (Dual Correlation Mining) framework. The MGS framework is an approach for global evaluation, building upon the hypothesis of the existence of a hidden schema model that probabilistically generates the schemas we observed. This evaluation estimates all possible models, where a model expresses all attributes matchings. Nevertheless, this approach does not take into consideration complex mappings. DCM framework has been proposed for local evaluation, based on the observation that cooccurrence patterns across schemas often reveal the complex relationships of attributes. However, these approaches suffer from noisy data. The works suggested in (Chen et al, 2005; He et al, 2006) outperform (He et al, 2004; He et al, 2003) by adding sampling and voting techniques, which are inspired by bagging predictors. Specifically, this approach creates a set of matchers, by randomizing input schema data into many independently down sampled trials, executing the same matcher on each trial and then aggregating their ranked results by taking majority voting. HSM (Holistic Schema Matching) (Su et al, 2006a) and PSM (Parallel Schema Matching) (Su et al, 2006b) have been proposed to find matching attributes across a set of Web database schemas of the same domain. HSM integrates several steps: matching score calculation that measures the probability of two attributes being synonymous, grouping score calculation that estimates whether two attributes are grouping attributes. PSM forms parallel schemas by comparing two schemas and deleting their common attributes. HSM and PSM are purely based on the occurrence patterns of attributes and require neither domain-knowledge, nor user interaction. Clustering based approach: This approach has been presented in (Pei et al, 2006a; Pei et al, 2006b). First, schemas are clustered based on their contextual similarity. Second, attributes of the schemas that are in the same schema cluster are clustered to find attribute correspondences between these schemas. Third, attributes are clustered across different schema clusters using statistical information gleaned from the existing attribute clusters to find attribute correspondences between more schemas. The K-means algorithm has been used in these three clustering tasks and a resampling method has been proposed to extract stable attributes from a collection of data. Figure 4. Holistic schema matching 3. Quality of Matching in PLASMA (Platform for LArge Scale schema MAtching) The goal of this section is to show how quality of matching can be integrated in our scalable schema matching system PLASMA (PLatform for LArge Scale MAtching) and how we can evaluate this quality. In our platform PLASMA, the quality of matching takes place in each phase to evaluate each treatment. The final goal of QoM is to ensure scalability, adaptability and extensibility of our system.

10 System Description: The architecture of PLASMA (figure 5) is deployed in three phases: Pre-matching, matching and Post-matching. Figure 5. Architecture of PLASMA Pre-Matching: This phase represents a pre-treatment of voluminous schemas. The focus of Pre-Matching phase is to find the common and similar characteristics between various XML schemas in an automated manner to effectively facilitate the matching process. It includes: an XML schema parser to analyse and transform XML schemas into trees, a thesaurus to address the issue of synonyms, abbreviations similarities, a holistic module to find the most similar sub-schemas and a QoM module. To improve the quality of matching, we propose the use of tree mining algorithms. The goal of this operation is to find all common and similar sub-structures. The result of prematching phase is a set of the most similar sub-structures ready for matching. The interest of this approach is to reduce the complexity of large scale matching and to improve the performance of matching algorithms. In fact, this approach is based on decomposing large schemas into smaller ones. Then matching will be performed between small schemas. The module QoM evaluates then: Human interaction: Due to the use of thesaurus, the user effort must be evaluated. The metric used to evaluate this effort is the manual effort. An intuitive formula to give the Effort deviation called Ed should be calculated in function of RPD: the Real Person Days and the PPD: the Planned Person Days. Ed = RPD PPD PPD*100 Mining techniques: We determine the performance, execution times and scalability of the used tree mining algorithms.

11 Matching: The resulted common subs-schemas are matched in this phase. We apply then structural matcher (pair-wise module) on these sub-schemas instead of matching all the original input schemas. Then matching large schemas is reduced to the matching of much smaller ones. We apply our matching algorithm (Chukmol et al., 2005) to discover structural correspondences between pair of schema elements. This similarity considers the context of the elements. The quality of matching module is defined by the evaluation of execution times of matching algorithms and the quality of resulted matchings. To calculate the match quality measures, we define the following metrics: The Precision and Recall are largely used in the field of the information retrieval and they are also used in the evaluations of matching systems in (Do et al., 2002). The Overall is developed specifically in the context of schema matching. It measures the effort of post-matching necessary to add the true negative and remove the false positive. The following formulas are used to calculate these measurements: Precision = B, calculates the number of true correspondences B found B + C among those returned ( B + C ); C is called false positive. Recall = B A + B, calculates the number of true correspondences B found among the total of true correspondences ( A + B ); A is called true negative. A + C B C 1 precision Overall = 1 = = Re call * (2 ), represents the effort needed to A + B A + B correct the results of an automatic matching (i.e. adding the true negative and removing the false positive). Post-Matching: This module combines the structural and linguistic matchings. We select the highly ranked matchings that represent the most pertinent results. We use different measures to select the best correspondences for an element from a set possible matchings. The output is the set of elements correspondences and the most similar schemas. These results will be saved for a forthcoming use. To this end, we evaluate the quality of the selected matching to assure the reuse of good and valuable results. 4. Conclusion and future works This paper presented a broad scope of quality of matching characteristics. We have presented the main importance of QoM in large scale matching scenarios. Since quality is very important to evaluate matching systems, we have proposed metrics to measure the quality of Matching (QoM) and defined the different factors that influence the quality. We have achieved a state of the art study covering strategies to improve QoM. Based on these existing techniques, we have proposed an approach based on tree mining algorithms to improve QoM and an evaluation of the quality in every phase of our system architecture PLASMA, which is a large scale matching system. In the future, we plan to implement all the proposed metrics in our system to evaluate the quality of the matching results and to test the scalability of our system.

12 5. References Avesani, P., Giunchiglia, F., & Yatskevich, M.(2005). A Large Scale taxonomy mapping Evaluation. In Proceedings of the 4th International Semantic Web Conference (ISWC), Galway, Ireland, Bernstein, P. A., Melnik, S.,Petropoulos, M., & Quix, C. (2004). Industrial-Strength Schema Matching. In ACM SIGMOD Record, Chukmol, U., Rifaieh, R. and Benharkat, A. (2005) EXSMAL: EDI/XML semiautomatic Schema Matching Algorithm. In the 7th International IEEE Conference on E-Commerce Technology (CEC), Do, H.H., Melnik, S., & Rahm, E.(2002). Comparison of schema Matching Evaluations. In GI-Workshop Web and Databases.Erfurt, Germany, Do H.H., & Rahm, E. (2007). Matching large schemas: Approaches and evaluation. In Journal of Information Systems, He, B., & Chen-chuan Chang, K.(2006). Automatic Complex Schema Matching Across Web Query Interfaces: A Correlation Mining Approach. In ACM Transactions on Database Systems (TODS). ACM Press, New York, Duchateau, F., Bellahsene, Z., and Hunt, E. (2007). Xbenchmatch: a benchmark for xml schema matching tools. In VLDB, He, B., Chen-Chan Chang, K., & Han, J.(2004). Discovering complex matchings across Web Query Interfaces: A Correlation Mining Approach. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press,New York, NY, He, B., & Chen-Chan Chang, K.(2003). Statistical Schema Matching across Web Query Interfaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. San Diego, California, He, H., Meng, W., Yu, C., & Wu, Z. (2005). WISE-Integrator : A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB). Trondheim, Norway, Hu, W. & Qu, Y.(2006). Block Matching for Ontologies. In Proceedings of the 5th International Semantic Web Conference (ISWC). Athens, GA, USA, Hu, W., Zhao, Y., & Qu,Y. (2006). Partition-Based Block Matching of Large Class Hierarchies. In Proceedings of the First Asian Semantic Web Conference (ASWC). Beijing, China, Lu, J., Wang, S., & Wang, J. (2005). An experiment on the Matching and Reuse of XML Schemas. In Proceedings of the 5th International Conference on Web engineering (ICWE)). Sydney, Australia, Madhavan, J., Bernstein, P. A., Doan, A., & Halevy, A.Y. (2005). Corpus-based Schema Matching. In Proceedings of the 21st International Conference on Data Engineering (ICDE). Tokyo, Japan, Melnik S., Garcia-Molina H., Rahm E. (2002) Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching. Proceedings de la conférence ICDE 02, San Jose, CA, 26 Février - 1 Mars, Pei, J., Hong, J., & Bell, D.A. (2006a). A Novel Clustering-based Approach to Schema Matching. In Proceedings of the 4th International Conference on Advances in Information Systems (ADVIS). Izmir, Turkey, Pei, J., Hong, J., & Bell, D.A.(2006b).A Robust Approach to Schema Matching over Web Query Interfaces. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDE Workshops).Atlanta, GA.

13 Qu, Y., Hu, W., & Cheng, G. (2006). Constructing Virtual Documents for Ontology Matching. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM Press Edinburgh, Scotland, Rahm, E., Do, H.H., & Maβmann, S.(2004). Matching Large XML Schemas. In SIGMOD Record. ACM Press, New York, NY, Shvaiko P., & Euzenat J. (2005). A Survey of Schema-based Matching approaches. Journal on Data Semantics IV 3730, Smiljanic, M., Keulen, M., & Jonker, W. (2006). Using Element Clustering to Increase the Efficiency of XML Schema Matching. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDE Workshops).. Stuckenschmidt, H., & Klein, M. (2004). Structure-based Partitioning of large concept hierarchies. In Proceedings of the 3rd International Semantic Web Conference (ISWC). Hiroshima, Japan, Su, W., Wang, J., & Lochovsky, F. (2006a). Holistic Schema Matching for Web Query Interface. In Proceedings of the 10th International Conference on Extending Database Technology (EDBT). Munich, Germany, Su, W., Wang, J., &Lochovsky, F. (2006b). Holistic Query Interface Matching using Parallel Schema Matching. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). Atlanta, GA. Wang, Z., Wang, Y., Zhang, S., Shen, G. & Du, T. (2006). Effective Large Scale Ontology Mapping. In Proceedings of the First International Conference Knowledge Science, Engineering and Management (KSEM). Guilin, China, Wang, Z., Wang, Y., Zhang, S., Shen, G., & Du, T. (2006). Matching Large Scale Ontology Effectively. In Proceedings of the First Asian Semantic Web Conference (ASWC). Beijing, China, Wang, G., Rifaieh, R., Goguen, J., Zavesov, V., Rajasekar, A., and Miller, M. (2007). Towards user centric schema mapping platform. In International Workshop on Semantic Data and Service Integration, Vienna, Austria. Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms. Neural Networks, IEEE Transactions, 16:

XBenchMatch: a Benchmark for XML Schema Matching Tools

XBenchMatch: a Benchmark for XML Schema Matching Tools Fabien Duchateau, Zohra Bellahsene, Ela Hunt To cite this version: Fabien Duchateau, Zohra Bellahsene, Ela Hunt. XBenchMatch: a Benchmark for XML