Charles University in Prague Faculty of Mathematics and Physics HABILITATION THESIS. Martin Nečaský, Ph.D.

Size: px

Start display at page:

Download "Charles University in Prague Faculty of Mathematics and Physics HABILITATION THESIS. Martin Nečaský, Ph.D."

Angelina Willis
6 years ago
Views:

1 Charles University in Prague Faculty of Mathematics and Physics HABILITATION THESIS Martin Nečaský, Ph.D. Conceptual Modeling and Evolution of XML Schemas Prague, January 2014

3 Conceptual Modeling and Evolution of XML Schemas Habilitation thesis Martin Nečaský January necasky Charles University in Prague Faculty of Mathematics and Physics Department of Software Engineering Malostranské nám , Prague 1 Czech Republic This thesis contains copyrighted material. The copyright holders are: c Springer-Verlag c Elsevier B.V. c Australian Computer Society c IEEE i

4 ii

5 Acknowledgments Since the very nature of scientific work is cooperation, I would like to thank especially to all my colleagues and co-authors. In the first place my thanks go to the members and Ph.D. students of the XML and Web Engineering Research Group (XRG), namely to Jakub Klímek, Jakub Malý, Irena Holubová (Mlýnková), Jaroslav Pokorný and Karel Richta as well as all the excellent students whose Master theses and SW projects form significant parts of our research results. For helpful comments, suggestions, and ideas that enabled to improve our results I must undoubtedly express my gratitude to all the anonymous reviewers of our papers, audience at our presentations, and all the scientists I had the pleasure to meet and talk to. And, last but not least, I am very thankful to my wife Veronika. Her support and understanding enabled me to create this work. iii

6 iv

7 Contents 1 Commentary Introduction Motivation Five-Level Framework for Design, Evolution and Integration of XML Formats Logical, Operational and Extensional Level Platform-Independent and Platform-Specific Levels Methodology for Design and Evolution of XML Schemas Forward-Engineering of XML Formats Reverse-Engineering of XML Formats Evolution of XML Formats Comparative Analysis Thesis Structure and Author s Contributions When Conceptual Model Meets Grammar: A Dual Approach to XML Data Modeling 31 3 Translation of Structural Constraints from Conceptual Model for XML to Schematron 63 4 Semi-automatic Integration of Web Service Interfaces 91 5 Evolution and Change Management of XML-Based Systems101 6 Formal Evolution of XML Schemas with Inheritance Model-driven Approach to modeling and validating integrity constraints for XML with OCL and Schematron 139 v

8 8 Methodology Evaluation - Case Studies Case Study - ehealth Case Study - Public Procurement Basic PIM and PSM Schemas Improving Readability, Integrability and Adaptability Conclusion 193 vi

9 Preface The proposed thesis presents selected results of the author s research in the area of conceptual modeling and evolution of XML schemas. The research has been carried out at the Faculty of Mathematics and Physics of the Charles University in Prague in years , mainly within the XML and Web Engineering Research Group (XRG) 1 lead by Prof. RNDr. Jaroslav Pokorný, CSc. The results are presented as a collection of 6 most important papers of the author published in recent years and some complementary material. The work can be logically divided to 4 parts. The first part shows how a designer can save some time and avoid errors when designing a set of XML schemas for her software system. Savings are achieved by applying a conceptual modeling approach instead of designing her XML schemas directly with some XML schema language (e.g. XML Schema [48]). The first part consists of Chapters 2, 3, and 4. The most valuable paper of this part is [38] presented in Chapter 2. The paper was published in Data & Knowledge Engineering Journal whose editor in chief is Peter P. Chen, the author of the very first paper on the Entity-Relationship Model which forms a base for all papers in the area of conceptual modeling. Therefore, this paper is of special value for the author of this habilitation work. The other two papers [28, 44] in Chapters 3 and 4 were published at Asia-Pacific Conference on Conceptual Modelling (APCCM) (2013) and International Conference of Web Services (ICWS) (2010). Both conferences are recognized in their fields and are very relevant for presenting results in the area of conceptual modeling and XML schemas. The former paper is an extended version of the conference paper. It extends the conceptual modeling technique proposed in [38] so that XML schemas expressed in Schematron [22] can be designed. The other paper extends the conceptual modeling technique with a technique for deriving a conceptual schema from existing 1 vii

10 XML schemas so called reverse-engineering technique. The second part is related to the area of XML schema evolution. It assumes that a designer has already designed his XML schemas with the techniques proposed in the first part of this work. It shows how the designer can save some time and avoid some errors when she needs to change something in the XML schemas. A change can be done at the structural level (e.g., renaming an XML element) or at the semantic level (e.g. replacing an XML element with two new XML elements which have the same meaning as the former one but express the reality in a more detail). The second part consists of Chapters 5 and 6. In this part, the most valuable paper is [37] presented in Chapter 5. The paper was published in Journal of Systems and Software which is a recognized journal in the field of software engineering. It has a long tradition dating back to 80s. The other paper [27] extends the conceptual modeling technique proposed in [38] with new constructs for modeling inheritance and studies evolution of XML schemas with the presence of these new constructs. It has been published at International Conference of Web Services (ICWS) (2012). The third part extends the conceptual modeling technique proposed in [38] so that it is also possible to express more complex integrity constraints at the conceptual level with Object Constraints Language (OCL). These conceptual level integrity constraints are then (semi-)automatically translated to expressions in XML languages suitable for expressing integrity constraints at the XML schema level (Schematron and XSLT [24]). This part is covered by Chapter 7. It presents a paper [30] published in Information Systems Frontiers journal. The last part complements the previous parts with two real-world case studies. They demonstrate how the proposed techniques can be applied in practice. An interested reader can do own experiments thanks to a tool which implements all above mentioned techniques. It is called exolutio and can be downloaded on a site dedicated to the tool 2. The first case study is from the field of public procurement. It shows how XML schemas of an existing public procurement software system can be modeled and evolved using our approach. The second study is from the field of electronic health (e-health) and shows how a real-world XML communication standard can be modeled and evolved. The unifying commentary is provided in Chapter 1. Prior to summarizing 2 viii

11 the papers, the commentary provides a motivation and briefly surveys related state-of-the-art results. In Chapter 9 we conclude. The research included in the selected papers has been supported by several grants, namely Information Society 1ET , GAČR 201/06/0756, GAČR 201/09/0990, GAČR P202/11/P455, GAČR P202/10/0573, and TAČR TA Prague, January 2014 Martin Nečaský ix

12 x

13 Chapter 1 Commentary 1.1 Introduction The extensible Markup Language (XML) is currently a de-facto standard for data representation and together with accompanying standards, such as XML Schema [48], XQuery [11], XSLT [24] etc., it becomes a powerful tool. Consequently, the amount and complexity of software systems that utilize XML and/or selected XML-based standards and technologies for information exchange and storage grows very fast. The systems represent information in a form of XML documents. One of the crucial parts of such systems are XML formats which describe the syntax of the XML documents. XML schemas, expressed in a selected XML schema language, e.g. XML Schema [48] or Relax NG [16], are used to express the formats in a machine interpretable notation. Usually, a system does not use only a single XML format, but a set of different XML formats, each in a particular logical execution part. The XML formats usually represent particular views on the application domain of the software system. For example, a software system for customer relationship management (CRM) exploits different XML formats for purchase orders, customer details, product catalogs, etc. All these XML formats represent different views on the CRM domain. We can, therefore, speak about a family of XML formats used by a software system. Having such a system, XML schema engineers face the problem of designing XML schemas of the XML formats used by the system and later evolution of the XML schemas when various change request arrive from different system 1

14 stakeholders (e.g., users, investors, administrators, etc.). When designing XML schemas, current designers use XML schema languages like XML Schema [48] or Relax NG [16]. Each separate XML format is described with a separate XML schema. Because there are usually more XML formats, the designer has to design many XML schemas for his system. As we have described, XML schemas represent different views of the same application domain. In other words, a single conceptual concept, e.g. a customer, can be represented in more different XML schemas in different ways. In one XML schema, a customer can be a part of a purchase order and only basic customer information can be considered. In another XML schema, a full detail of the customer can be considered, but without his purchase orders (which are represented elsewhere). This brings three basic problems (we will denote them P1-3) which make designing larger sets of XML schemas error-prone and time-consuming activity. P1: First, any description of the application domain based on XML schemas is scattered across those various XML schemas. There is no single point, a single schema, which would describe the domain as a whole, i.e. its important concepts (e.g., customer, purchase order, product) and relationships between them. Therefore, it is hard to understand the domain only from the XML schemas. P2: Second, it is hard to decide whether the description of the application domain is correct and complete. If the designer needs to check whether a given concept (e.g., a price) is represented with the same level of granularity and with the same elements, he has to read through all the XML schemas and, therefore, he can easily forget something. Moreover, if a concept has been identified during the domain analysis but it is not currently used in any XML schema, it is not represented anywhere and it may be easily forgotten. P3: And, third, the XML schemas may need to be evolved whenever user requirements or surrounding environment change. Each such change may influence multiple different XML schemas in the set. Without a proper technique, the designer has to identify the XML schemas affected by the change manually and ensure that they are evolved coherently with each other and with the rest of the system. When the XML schemas have already been deployed, there will also be XML documents which might become invalid and will require appropriate modification. Evolution brings a challenge for research and its main objective is to minimize the user interaction and, hence, the expensive and error-prone work. Naturally, we cannot cut out the user completely, there still remain cases 2

15 where user decisions are necessary; however, automatic management of evolution enables one to identify all the affected parts of the XML formats and perform the user-selected changes correctly and efficiently. In our research group, we have focused on the area of efficient and correct management of a set of XML schemas coexisting in a given software system. This includes techniques for designing XML schemas and their correct evolution. The work of the research group consists of tens of journal and conference proceedings papers. The results presented in the papers are also implemented in a software tool exolutio 1. The author of this habilitation thesis coordinated all works of the group in the area. He is the main author of many of those publications. Most importantly, he is the main author of two extensive publications [38][37] which form the base of all works of the research group in the area. The former one introduces the formal background of the conceptual model for XML. The later one introduces the formal background for XML schema evolution. Both papers are presented in this thesis in Chapter 2 and Chapter 5, respectively. The other papers presented in this thesis were also co-authored by the author of this thesis. He contributed with the theoretical and evaluation parts of these papers. Their other authors were his Ph.D. students who contributed mainly to the implementation and experimental parts and with algorithmic specifications. The first chapter of this thesis summarizes the papers presented in this thesis and relates them together. Section 1.2 contains a simple example which serves as a motivation for our approach. It explains the basic problems related to XML schema design and evolution which we cover in this thesis. Section 1.3 summarizes our work from the architectural viewpoint. It describes so called five-level framework which consists of five levels of abstraction of XML documents. The top-most level is conceptual and represents the basic point of our approach - a conceptual schema of the application domain from where all other levels are derived. As we will show, this architecture helps to solve the above mentioned problems. Section 1.4 summarizes our work from the methodological viewpoint. It shows how an XML schema designer should work when using our proposed techniques. This section serves as a methodology for the designer. It is complemented with extensive examples which explain the methodology in detail. It has three parts deriving a new XML schema on the base of a 1 See for download. 3

16 conceptual schema, integrating an existing XML schema which is not derived from the conceptual schema, and, finally, implementing a change to existing XML schemas derived from the conceptual schema. The first part is useful when designing a new system from the scratch. In that case, the designer has an option to use our techniques from the beginning. However, there are many existing systems where only XML schemas exist without an explicitly expressed conceptual schema. In these cases, the second part of the methodology can be useful. Section 1.5 provides a comparative analysis of our approach to XML schema design and evolution with other existing approaches. However, this is only a brief comparison. Detailed analysis of the related work is provided by each particular paper presented in this thesis. Therefore, we recommend a reader interested in a particular topic covered by this thesis to read the related section of each of the papers. Section 1.6 summarizes the structure of the rest of this thesis, i.e. particular papers of the author presented in this thesis. It puts them into the context of the work of the whole research group and also highlights the author s contributions for each of the papers. 1.2 Motivation A typical life cycle of a more complex software system starts with its correct design based on the discussion between future users and analysts. The two groups need a common tool that enables them to communicate mutually and specify the target requirements a kind of a conceptual model that describes the problem domain. Naturally, the first proposal of the system is gradually extended and modified, until an agreement is reached and implementation of the system can start. However, it is necessary that modifications and new extensions are consistent with all previously modeled parts or that these parts are evolved accordingly with the proposed changes. Later, when the system is implemented and deployed, sooner or later the user requirements change and all the affected parts need to be adapted accordingly. In this case, the changes should not be that wide, assuming that the analytical part was performed conscientiously. But still, changes can occur that may influence several parts of the system. And since the system is already deployed, identification of all these parts and their correct modification is crucial. Last but no least a deployed implementation can face another big change which occurs when 4

17 the system must be extended with a new part or integrated with another system. In this case the integration process must ensure that the resulting system works the same way as if the application was proposed as a whole from the beginning. So, correct mapping of related parts and establishing of respective relationships is a key aspect. As a demonstration of the problem of design and evolution of a set XML schemas coexisting in a single software system, let us consider a company that receives purchase orders and let us focus on a part of the system that processes purchases. Let the messages used in the system be XML messages formatted according to a family of different XML schemas. Consider the two sample XML documents in Figure 1.1. The former one is formatted according to an XML schema for a list of customers. The latter one is formatted according to a different XML schema for purchase requests. There are also other XML schemas in the set (e.g., customer details, purchase responses, purchase transport details, etc.). All share the same application domain (customer relationship management). On the other hand, the same part of the domain may be represented according to various XML schemas in different ways because of the different purposes for which the XML schemas are present in the system. For example, the concept of customer is represented in each of our sample XML schemas in a different way. In the first one, different kinds of customers are distinguished (private and corporate customers). We need the full detail for each customer in the XML document. For private customers, elements name, address and phone are present. For corporate customers, elements name, different addresses (headquarters, storage and secretary), and phone are present. In the second XML document, we do not distinguish different types of customers. We have only element cust with child elements name, , ship-to and bill-to. The last two represent addresses. We need a unified representation of shipping and billing addresses in purchase requests for all kinds of customers. Note that in a closed system that is under our control, it would be better to model the addresses using inheritance to share common parts and then use a unified address format in all used XML schemas. However, the specific XML schemas for addresses can be defined by various parties and they can not be always unified, as it is in our example. We may need to adhere to various XML schemas when communicating with various business partners using XML messages. They have their own schemas and it may be up to us to respect them in our system. That is why we need to take into account the possibility of having to maintain different address formats in various XML 5

18 <custlist version="1.3"> <cust> <name>martin Necasky</name> <address>vaclavske nam. 123, Prague</address> <phone> </phone> </cust> <cust> <name>charles University</name> <hq>malostranske nam. 25, Prague</hq> <storage>ke Karlovu 3, Prague</storage> <secretary>ke Karlovu 5, Prague</secretary> <phone> </phone> </cust> </custlist> (a) List of customers XML document <purchaserq version="1.0"> <cust> <name>charles University</name> <bill-to>malostranske 25, Prague</bill-to> <ship-to>ke Karlovu 3, Prague</ship-to> </cust> <items> <item code="p045"><price>17</price></item> <item code="p332"><price>34</price></item> </items> </purchaserq> (b) Purchase order request XML document Figure 1.1: Sample XML documents formatted according to two different XML schemas schemas. This simple example nicely demonstrates the problems which arise when a designer describes the application domain only with XML schemas. First, the description of the domain is scattered across various XML schemas. Therefore, it is hard to check, e.g. how the concept of address is represented in different XML schemas (in which and how). The designer must go through all XML schemas and check whether they somehow represent the concept of address and find it. Is is also hard to hold information about concepts which are important for the problem domain but have not been represented in any XML schema yet (but will probably be in the future). For example, the concept of discount is important since it has been identified by our domain experts. However, XML schema designers did not have time to include it in the XML schemas. Without a conceptual schema (where this information could be recorded) they have to remember all information in their minds or note it in their textual notes. Let us now consider a new user requirement that an address should no longer be represented as a simple string. Instead, it should be divided into elements street, city, zip, etc. Such a situation would require a skilled domain expert to identify all the XML schemas which involve an address and correct them respectively. In a complex system comprising tens or even 6

19 hundreds of XML schemas (possibly some of them integrated from external partners and, hence, out of our control), this is a difficult and error-prone task. Even identifying the affected parts of an XML schema is not an easy and straightforward process. For example, we may need to make the modification only for addresses that represent a place to ship the goods (which are the elements address and storage in the XML schema instantiated in the first schema and element ship-to in the second schema). We do not want to modify addresses that represent headquarters, etc. Hence, we need to be able to preserve a kind of semantic relationship between the represented parts of the system. 1.3 Five-Level Framework for Design, Evolution and Integration of XML Formats Our five-level framework allows for design and later maintenance (semantically coherent evolution) of a set of XML schemas which co-exist in a software system and share the same application domain. We established a formal base of the framework in [38] (Chapter 2) and firstly described its levels in [37] (Chapter 5). The architecture of the framework is depicted in Figure 1.2. It is partitioned both horizontally and vertically. Vertical partitions represent individual XML formats. Horizontal partitions represent different levels which characterize each of the XML formats from different viewpoints: The extensional level contains XML documents formatted according to the XML format. The operational level contains operations performed over the XML documents from the extensional level. These can be queries over the instances or transformations of the instances. The logical level contains a logical XML schema which specifies the syntax of the XML format. It is expressed in an XML schema language. The platform-specific level contains a schema which specifies the semantics of the XML format in terms of the platform-independent level. 7

20 The platform-independent level contains a common conceptual schema. It provides the information model of the system and covers the common semantics of the XML formats. Platform-Independent Level PIM diagram Platform-Specific Level PSM diagram 1... PSM diagram i... PSM diagram n Logical Level XML schema 1 XML schema i... Operational Level XML XML queries documents XML XML queries documents Extensional Level XML XML documents documents XML XML documents documents Figure 1.2: Five-level XML evolution framework As we can see, the framework covers the syntax and semantics of the XML formats as well as their instances and operations performed over the instances. However, the XML documents, queries and schemas at different horizontal levels are not the only first-class citizens of our framework. There are also mappings between the horizontal levels depicted as solid lines. They are crucial for correct evolution. Evolution means that a change to any XML format made by a designer is correctly propagated to all other relevant affected parts so that all parts of the framework remain consistent. The relevancy results from a particular real-world application. For example, in the area of XML data the propagation from extensional to logical level is not used much, though there exist approaches dealing with this direction (see Section 1.5). Consistency is a relatively vague term. We formalize it as a state where all mappings between the horizontal levels required by the framework exist. Mappings between extensional, operational and logical level represent syntactic consistency. Syntactic consistency means that XML instances are valid against the XML schemas and operations work correctly with the structure 8

21 defined by the XML schema. Mappings between logical, PSM and PIM, levels represent semantic consistency. Semantic consistency means that there is a unified conceptual description of the problem domain (i.e. the PIM schema) and logical XML schemas of all XML formats in the family are unambiguously mapped to that description. The mapping specifies that the semantics behind each mapped component of a logical XML schema is specified by the component in the conceptual schema As Figure 1.2 shows, we consider that the designer makes a change at the logical, platform-specific or platform-independent level (the upward arrows). From here, it is propagated to all other parts of the framework (i.e. to all levels for all XML formats). Changes at the operational or extensional level can also appear. However, as we have mentioned, even though it could be theoretically possible to propagate them to the upper levels, it is usually not meaningful. For example, it is not much meaningful to propagate a change from the extensional level (i.e. a change made in a particular XML document) to the logical level (i.e. to the corresponding XML schema) and higher. A change in an XML document does not usually mean that the respective XML schema needs to be adapted. On the other hand, the framework can identify that the change in the XML document does not correspond to the XML schema and notify the designer Logical, Operational and Extensional Level The lowest level, called extensional level, represents particular XML schema instances that occur in the system. The instances are XML documents which are persistently stored in an XML database or exchanged between parts of the system or between the system and other systems as messages. The level one step higher, called operational level, represents operations over the instances. These might be, e.g., XML queries over the instances expressed in XQuery [11] or transformations of the instances expressed in XSLT [24]. The level above, called logical level, represents logical schemas that describe the structure of the instances. They are expressed in a selected XML schema language, e.g., XML Schema (XSD) [49], Relax NG [16], Schematron [22], etc. We demonstrate the three levels in Figure 1.3. It shows our two sample XML formats represented at the three levels. There are two kinds of mappings between the levels. There are mappings of instance XML documents to their XML schemas. The instances are XML documents valid against the XML schema. An instance XML element or 9

22 Logical Level <element name="custlist"> <sequence> <element name="cust" type="customer".../> </sequence> </element> <complextype name="customer"> <sequence> <element name="name" type="string"... /> <choice> <element name="address" type="string" /> <sequence> <element name="hq" type="string" />... </sequence> </choice> </sequence> </complextype> <element name="purchaserq"> <sequence> <element name="cust" type="cust" /> <element name="items"> <sequence> <element name="item" type="item".../> </sequence> </element> </sequence> </element> <complextype name="cust"> <sequence> <element name="name".../> <element name="code".../> <element name="ship-to".../> <element name="bill-to".../> </sequence></complextype> Op. Level for for $c for $p in $p in //cust in /purchaserq where return $c/hq return fn:sum( return for for $it $it in in $p//item <corporate>{$c/name} where where $it/price > > return </corporate> return $price) $price) for for $p for $p in $p in /purchaserq in /purchaserq return return fn:sum( return fn:sum( for for $it for $it in $it in $p//item in $p//item where where $it/price where > $it/price > 20 > return return $price) return $price) Extensional Level <purchaserq <purchaserq version="1.0"> <custlist version="1.0"> <bill-to>malostranske version="1.3"> <bill-to>malostranske nam. nam. 25, 25, Prague</bill-to> <cust> Prague</bill-to> <ship-to>ke <ship-to>ke Karlovu Karlovu 3, 3, Prague</ship-to> <name>martin Prague</ship-to> <cust> Necasky</name> <address>vaclavske <cust> <name>department 123, <name>department of of Software Prague</address> Software Engineering, <phone>123 Engineering, Charles Charles University</name> </phone> </cust> University</name> <cust> </cust> <name>department </cust> <items> of Software Engineering, Charles <items> <item> University</name> <hq>malostranske <item> <code>p045</code> nam. 25, Prague</hq> <storage>ke <code>p045</code> </item> Karlovu 3, Prague</storage> <secretary>ke </item> <item> Karlovu 5, Prague</secretary> <phone>111 <item> <code>p332</code> </phone> </cust> <code>p332</code> </item> </custlist> </item> </items> </items> </purchaserq> </purchaserq> <purchaserq <purchaserq version="1.0"> version="1.0"> <bill-to>malostranske <bill-to>malostranske nam. nam. 25, 25, Prague</bill-to> <purchaserq Prague</bill-to> <ship-to>ke version="1.0"> <ship-to>ke Karlovu Karlovu 3, 3, Prague</ship-to> <cust> Prague</ship-to> <cust> <name>department <cust> <name>department of <name>department of Software of Software Engineering, Software Engineering, Charles Engineering, Charles University</name> Charles University</name> University</name> <bill-to>malostranske </cust> 25, Prague</bill-to> <ship-to>ke </cust> <items> Karlovu 3, Prague</ship-to> </cust> <items> <item> <items> <item> <code>p045</code> <item <code>p045</code> </item> code="p045"><price>17</price></item> <item </item> <item> code="p332"><price>34</price></item> </items> <item> <code>p332</code> </purchaserq> <code>p332</code> </item> </item> </items> </items> </purchaserq> </purchaserq> (a) XML format for list of customers (b) XML format for purchase requests Figure 1.3: Two sample XML formats represented in the framework 10

23 attribute is mapped to its respective definition in the XML schema. The mapping is created automatically during validation. For example, XML elements cust in the instances of the XML format on the left of Figure 1.3 are mapped to the definition of the XML element cust in the XML schema. A valid instance is fully mapped to its XML schema. The other kind are mappings of operations to XML schemas. Operations are based on the XPath language whose basic construct is a path comprising steps which select XML elements and attributes from the instance XML documents. The steps also specify required hierarchical relationships between the selected XML elements and attributes (e.g., parent/child or ancestor/descendant). A path is mapped to a respective chain of XML element or attribute definitions in the XML schema. The definitions are in the structural relationship specified by the path steps. The mapping is created automatically during the validation of a path (similarly to validation of XML documents). For example, there are the following paths in the query for the XML format on the left of Figure 1.3: //cust/hq and //cust/name. They are mapped to the corresponding XML element definitions as depicted by the arrows. A consistent XML query has all its paths mapped. When the structure of an XML schema changes, its instances and related queries must be adapted accordingly so that their validity and correctness is preserved respectively. Some changes can be propagated automatically. However, there are also changes where automatic propagation is not always possible. For example, suppose that we want to split XML elements name in the left-hand side XML format from Figure 1.3 into two XML elements firstname and lastname. This change can be automatically propagated to the mentioned query (as a corresponding split of path /cust/name to two paths /cust/firstname and /cust/lastname). However, it cannot be automatically propagated to the instances until a designer specifies a function which splits a string with a full name to two substrings with first and last name Platform-Independent and Platform-Specific Levels If we consider only the three described levels, we have no explicit relationship between the vertical partitions, i.e. between the XML formats modeled by the framework. As we have already discussed, a change in one XML format 11

24 can trigger changes in the other XML formats to keep their consistency. Therefore, a change in one XML schema must be propagated to the other affected XML formats manually by a designer. This is, of course, highly timeconsuming and error-prone solution. The designer must be able to identify all the affected formats and propagate the change correctly. Often, (s)he is not able to do such a complex work and needs a help of a domain expert who understands the problem domain, but is, typically, a business expert rather than a technical XML format expert. Therefore, it is very hard for him to navigate in the logical XML schemas, operations and instances. To overcome these problems, we introduce two additional levels to the framework briefly described at the beginning of the section. They represent two additional levels of abstraction of the XML formats and are motivated by the MDA [31] principles. The platform-independent level comprises a single conceptual schema of the problem domain. We call it PIM schema (= schema in a platform-independent model) and use the notation of UML [39][40] class diagrams to express it. A sample PIM schema modeling the domain of customers and their purchases is depicted in Figure 1.4. The platform-specific level comprises an individual schema for each XML format. We call it PSM schema (= schema in a platform-specific model) and use UML class diagrams as well. We introduced few extensions to the UML notation to be suitable for modeling XML formats. For their full description we refer to [38] (Chapter 2). A PSM schema has a hierarchical structure since it models an XML format. Two sample PSM schemas for our two XML formats are depicted in Figure 1.4. A PSM schema can be viewed from two perspectives conceptual and grammatical. The conceptual perspective models the semantics of the XML format in terms of the PIM schema. The semantics is modeled as an unambiguous mapping of the components of the PSM schema to the components of the PIM schema. We demonstrate the mapping in Figure 1.4 on the righthand side. Let us focus on PSM classes P rivatecus and CorporateCus. They both have the same attributes. The question is why we do not model them as a single class, e.g., Customer. The reason is that a single class would not allow us to distinguish their semantics. Therefore, we model them as two separate classes but with the same attributes. Let us note that in the corresponding XML schema (see the logical level bellow) both classes are expressed using the same construction (Cust complex type) which is correct because we do not distinguish the semantics at the logical level. There is depicted the mapping of PSM class P rivatecus to PIM class 12

25 Logical Level PSM Level PIM Level CorporateCus headquarters storage secretary PrivateCus address CustomerListSchema PrivateCus name address phone custlist CustomerList cust 1..* Customer code name phone CorporateCus name hq storage secretary phone <element name="custlist"> <sequence> <element name="cust" type="cust".../> </sequence> </element> <complextype name="cust"> <sequence> <element name="name"... /> <choice> <element name="address"... /> <sequence> <element name="hq"... />... </sequence> </choice> </sequence> </complextype> (a) XML format for list of customers 0..* made Purchase status date number price PrivateCus name code ship-to bill-to 1..* Item 0..* PurchaseRQSchema purchaserq Purchase cust CorporateCus name code ship-to bill-to Product code title list-price items price Items item Item <element name="purchaserq"> <sequence> <element name="cust" type="cust" /> <element name="items"> <sequence> <element name="item" type="item".../> </sequence> </element> </sequence> </element> <complextype name="cust"> <sequence> <element name="name".../> <element name="code".../> <element name="ship-to".../> <element name="bill-to".../> </sequence></complextype> (b) XML format for purchase requests 1..* Figure 1.4: Two sample XML formats represented at logical, PSM and PIM levels 13

26 P rivatecus in the figure. The PSM attributes name, code, shipto and billto are mapped to PIM attributes name, code and address, respectively. The PSM attributes shipto and billto are mapped to the same PIM attribute. Therefore, the semantics of the portion of the PSM schema is that P rivatecus class models a private customer with a name and code. Both shipping and billing address are the same address evidenced in the system for the customer. The PSM class CorporeCus is mapped similarly but its PSM attributes shipto and billto are mapped to the PIM attributes storage and headquarters, respectively. In other words, the semantics of shipto and billto attributes in the modeled XML formats is different for the private and corporate customers. Associations are mapped as well. For example, both PSM associations going to P rivatecus and CorporateCus are mapped to the PIM association connecting the PIM classes Customer and P urchase. There can also be components which are not mapped. They are displayed in grey, e.g., class Items or PSM associations cust and items in the PSM schema on the right. These components have no semantics. From the grammatical perspective, a PSM schema models a grammar of the respective XML format. In other words, it models the syntax of the XML format which is expressed at the logical level as an XML schema. The conversion of the PSM schema to a corresponding XML schema (or vice versa, see Section 1.4) is automatic. Briefly, a class models a sequence of XML element and attribute declarations. An association with a name models an XML element whose content is the sequence modeled by its child class. During the conversion, an unambiguous mapping of the XML schema components to the PSM schema components is automatically created. We depict a portion of a sample mapping in Figure 1.4. Figure 1.4 also shows one of the extensions we introduced to the UML notation for the purposes of modeling XML schemas choice content models. They are depicted as grey ovals with symbol in the middle. A choice content model models a choice in the XML content. For example, the choice content model in the left-hand side schema in our example models a choice between two possible XML contents modeled by classes P rivatecus and CorporateCus. However, choice content models do not model choice in the XML content in every case (i.e. they differ from choice models known from XML schema languages like XSD or DTD). In the right-hand side schema of our sample the choice content model does not model a choice in the XML content because both variants are equivalent from the grammatical perspective (but not from the conceptual, as we have shown, so we have to distinguish 14

27 them in the PSM schema). In [38] (Chapter 2) we proved that the expressive power of our PSM schemas is the same as the expressive power of regular tree grammars [33] (RTG). RTG is a notation which allows to express an XML schema in a formal way and allows for reasoning about an expressive power of various XML schema languages. More about the expressive power of our PSM schemas can be found in [38] (Chapter 2). The complete framework forms a hierarchy which interconnects all modeled XML formats in the family using the common PIM schema. The consistent evolution of the XML formats is realized using this common point. For instance, if a change occurs in a selected XML schema, it is first propagated to the respective XML schema, PSM schema and, finally, to the PIM schema. We speak about an upwards propagation, in Figure 1.2 represented by the upwards arrows. It enables one to identify the part of the problem domain that was affected. Then, we can invoke the downwards propagation. It enables one to propagate the change of the problem domain to all the affected XML formats. In Figure 1.2 it is denoted by the downwards arrows. 1.4 Methodology for Design and Evolution of XML Schemas In this section, we introduce a methodology which guides XML schema designers in using our framework to (1) design new XML formats which are semantically consistent with already existing XML formats in the family, (2) integrate existing XML formats into the framework (e.g., XML formats defined by an industrial standardization organization), and (3) evolve the whole family of XML formats while preserving the achieved consistency. The methodology consists of various steps. Some of them must be done manually by the designer (and/or domain expert). Some of them are performed automatically by the framework or semi-automatically, i.e. the framework finds possible solutions and a human user selects the correct one. In this section, we do not describe the algorithms for the automatic and semiautomatic steps. They are thoroughly described in [37] (Chapter 5). We only point out what needs to be done manually and what can be automated. The methodology is described in steps performed by the designer and/or system. We denote the steps which must be performed manually by the designer with 15

28 (M). The steps which are performed automatically by the system are marked with (A). The steps which need the system to cooperate with the designer are denoted with (S-A) Forward-Engineering of XML Formats Let us first discuss a methodology which allows a designer to design a new XML format. The designer proceeds in the forward direction from the PIM level to the logical level and, therefore, we call the methodology forwardengineering of XML schemas. The result of the process are XML schema, PSM schema and possible extension to the PIM schema and also mappings between them. We demonstrate the methodology in Figure 1.5. Here, a designer is given a task to design a new XML format for a list of supplies of a given supplier. The designer initiates the following steps: 1. (M) The application domain is studied and described in a form of a PIM schema. It may happen that the PIM schema already exists but it does not fully cover the semantics of the designed XML format. Hence, it must be extended. The designer cooperates with a domain expert. This is necessarily a manual process. In our sample scenario, the designer extends the PIM schema with the model of suppliers (class Supplier) and product supplies (class Supply) (Figure 1.5, step 1). (A) An impact analysis of changes made at the PIM level to the existing XML formats modeled at the PSM level is performed automatically by the algorithms for change propagation. The results of the impact analysis are presented to the designer and domain expert. In our sample scenario, there is no impact to the XML formats only new components are created in the PIM schema. No existing components are modified or removed. 2. For each XML format which needs to be newly designed: (a) (M) In cooperation with the domain expert, the designer analyses what information (i.e. relevant concepts and relationships) must be represented in the instances of the target XML format (Figure 1.5, step 2(a)). This process is necessarily manual. 16

29 CorporateCus headquarters storage secretary PrivateCus address Supplier Customer code name phone Product code title list-price made 0..* 0..* Purchase status date number price 0..* Item code name 1..* Supply amount 1..* 1 PSM-to-PIM mapping 2(a,b) code name Supplier 1..* Product code title list-price Supply amount 1..* <element name="supplier"> <sequence> <element name="code" type="string" /> <element name="supply" type="supply".../> </sequence> </element> <complextype name="supply"> <sequence> <element name="code".../> <element name="amount".../> </sequence> </complextype> 2(d) 2(c) SuppliesSchema code code logical-to-psm mapping supplier Supplier supply Supply amount Product 1..* Figure 1.5: Demonstrations of forward engineering methodology 17

30 (b) (M) The designer manually identifies the part of the PIM schema which models the concepts and relationships identified in the previous step. In our example, it means classes Product, Supply and Supplier (Figure 1.5, step 2(b)). (c) (S-A) The selected part of the PIM schema is shaped into a PSM schema which models the aimed XML schema. The conversion is partly manual and partly automatic. The designer specifies the hierarchical structure manually. The mappings to the PIM schema are generated automatically as the designer builds the PSM schema from the PIM schema part. The designer must manually create additional PSM components which are not mapped to the PIM schema. In our example, the designer specifies that class Supplier will be specified as a root with nested class Supply which contains a product code (attribute code of class Product) and a supplied amount (attribute amount) (Figure 1.5, step 2(c)). (d) (A) The resulting PSM schema is automatically converted to a logical XML schema expressed in the selected XML schema language. The mapping of the logical XML schema to the PSM schema is also created automatically (Figure 1.5, step 2(d)). The full translation algorithm was published in [38] (Chapter 2). The methodology has several advantages. The designer works at the userfriendly level of UML class diagrams abstracted from the details of XML schemas. It also saves time, because shaping the part of the PIM schema to the PSM schema is much easier than manual creation of the XML schema. And, it avoids errors since the overall description of the problem domain is given and, therefore, the designer does not miss important real-world concepts or relationships in the designed XML schema. Additionally, (s)he does not add new concepts without seeing the impact to the overall PIM schema, which avoids introducing redundant or overlapping structures into the PIM Reverse-Engineering of XML Formats The designer can also integrate existing XML schemas into his/her solution in the bottom-up manner which we also call reverse-engineering. Such XML schema might be a legacy XML schema or an XML schema prescribed by a standardization organization or used by other party that we want to integrate 18

31 <element name="store"> <sequence> <element name="address" type="string" /> <element name="reserve" type="reserve".../> </sequence> </element> <complextype name="reserve"> <sequence> <element name="code".../> <element name="amount".../> </sequence> </complextype> 1 StoreSchema supplier Store address reserve Reserve code amount 1..* auto discovered mappings 2(b) StoreSchema supplier Store address reserve 1..* Reserve amount logicalto-psm mapping CorporateCus headquarters storage secretary PrivateCus address Supplier Customer code name phone Product code title list-price made 0..* 0..* Purchase status date number price 0..* Item Product code code name 1..* 1..* Supply amount 2(a) PSM-to-PIM mapping CorporateCus headquarters storage secretary PrivateCus address code name 2(c) Supplier 1..* Customer code name phone Product code title list-price Supply amount 1..* made 0..* 0..* 0..* Purchase status date number price 0..* Item Reserve amount 1..* Store address Figure 1.6: Demonstration of the reverse engineering methodology 19

32 into our system. We demonstrate the reverse-engineering methodology in a sample scenario depicted in Figure 1.6. A designer is given a task to integrate an existing XML schema of an XML format for reserves on a given store. The designer proceeds in the following steps: 1. (A) The XML schema is automatically converted to a corresponding PSM schema. The mapping of the XML schema to the PSM schema is also created automatically (Figure 1.6, step 1). The full translation algorithm was published in [38] (Chapter 2). 2. The designer maps the PSM schema to the PIM schema in the following steps: (a) (A) Various algorithms for automatic mapping discovery [19] are applied to help the designer with creating the mappings (Figure 1.6, step 2(a)). (b) (M) The designer can be required to augment the generated PSM schema manually. In our case, the generated class Reserve contains attributes code and amount. However, code belongs to PIM class Product from the conceptual perspective while amount belongs to the reserve information which is not modeled in the PIM schema yet. Therefore, the designer moves code to PSM class Product which will be mapped to PIM class Product. (S)he also creates an association connecting PSM classes Reserve and Product (Figure 1.6, step 2(b)). (c) (M) (S)he might also be required to manually complement the PIM schema in case it does not cover the semantics of the imported XML schema. The impact of the performed changes in the PIM schema to the other XML formats integrated in the framework is analyzed and reported to the designer. In our case, the designer creates new PIM classes Reserve and Store to cover the new kind of information (Figure 1.6, step 2(c)). The advantages are similar to the forward-engineering methodology. The main advantage for the designer is that (s)he works with UML class diagrams. It is much easier to map the PSM schema than the XML schema (because both PSM and PIM schema uses the UML notation). Let us note that it is also usual in practice that an XML schema is not available and only the instance documents (samples) are. It that case, 20

33 the first step of the methodology is preceded with an additional step which semi-automatically derives the XML schema from the provided set of sample documents. For this, various XML schema inference algorithms can be used, e.g., [25] Evolution of XML Formats When all required XML formats are designed or integrated into the framework it is in some point in time necessary to evolve them consistently. We demonstrate the evolution methodology in a sample scenario depicted in Figure 1.7. The designer needs to split private customer single-valued name into two values first name and last name in the PSM schema of the XML format for customer lists (the right-hand side PSM schema in Figure 1.7(a)). The scenario shows how the change is propagated to all affected parts of the framework. The designer proceeds in the following steps: 1. (M) The designer analyzes a new user requirement or change in the system environment in cooperation with the domain expert. This is necessarily a manual process. (S)he decides whether the subject of change is the whole application domain or only a particular XML format. In the former case, it is necessary to implement the change at the PIM level. In the latter one, the change will be done at the PSM level. 2. (M) The designer makes the initial change at the identified level. In our sample scenario, the initial change was made in the PSM schema on the right-hand side of Figure 1.7(b) as indicated by the gray arrow pointing to class PrivateCus with two new attributes fname and lname which replaced the original attribute name. 3. First, the upwards propagation is performed: (a) (A) The impact analysis of the change to the upper level is computed automatically and presented to the designer (see [37] (Chapter 5) for formal description of impact analysis algorithms). The mappings between the levels are exploited. The analysis shows inconsistencies caused by the change. The designer can rollback the initial change and the whole propagation process at this point. (b) (S-A) Possible scenarios of propagation of the change to the upper level are identified automatically and proposed to the designer. 21

34 Op. Level Op. Level Log. Level PSM Level PIM Level CorporateCus made Purchase 0..* headquarters storage secretary PrivateCus address Customer code name phone Product code title list-price 0..* status date number price 0..* Item CorporateCus Customer made Purchase 0..* name headquarters storage secretary PrivateCus fname lname address code phone Product code title list-price 0..* status date number price 0..* Item CorporateCus name headquarters storage secretary PrivateCus fname lname address Customer code phone Product code title list-price made Purchase 0..* status date number 0..* price 0..* Item PurchaseRQSchema CustomerListSchema PurchaseRQSchema CustomerListSchema PurchaseRQSchema CustomerListSchema purchaserq custlist purchaserq custlist purchaserq custlist Purchase CustomerList Purchase CustomerList Purchase CustomerList cust Customer name code items 1..* Items price item Item cust 1..* PrivateCus CorporateCus name address phone name hq storage phone cust Customer name code items 1..* Items price item Item cust 1..* PrivateCus CorporateCus fname name lname hq address storage phone phone PrivateCus fname lname code cust items CorpCus name code... cust 1..* PrivateCus CorporateCus fname name lname hq address storage phone phone <element name="purchaserq"> <sequence> <element name="cust"...> <sequence> <element name="name".../>... </sequence> </element>... </sequence> </element> <element name="custlist"> <sequence> <element name="cust"...> <sequence> <element name="name".../>... </sequence> </element> </sequence> </element> <element name="purchaserq"> <sequence> <element name="cust"...> <sequence> <element name="name".../>... </sequence> </element>... </sequence> </element> <element name="custlist"> <sequence> <element name="cust"...> <sequence> <element name="name".../>... </sequence> </element> </sequence> </element> <element name="purchaserq"> <sequence> <element name="cust"...> <choice> <element name="name".../> <sequence>...fname, lname... </sequence>... </choice>... <element name="custlist"> <sequence> <element name="cust"...> <choice> <element name="name".../> <sequence>...fname, lname... </sequence>... </choice>... for $p in /purchaserq for for $p $p in in /purchaserq return return return fn:sum( fn:sum( for $it for $it in in $p//item $p//item where where $it/price > 20 > return $price) for $c in //cust for for $p $p in in /purchaserq where return return fn:sum( $c/address fn:sum( return for for $it $it in in $p//item <private>{$c/name} where where $it/price > > return </private> return $price) $price) for $p in /purchaserq for for $p $p in in /purchaserq return return return fn:sum( fn:sum( for $it for $it in in $p//item $p//item where where $it/price > 20 > return $price) for $c in //cust for for $p $p in in /purchaserq where return return fn:sum( $c/address fn:sum( return for for $it $it in in $p//item <private>{$c/name} where where $it/price > > return </private> return $price) $price) for $p in /purchaserq for for $p $p in in /purchaserq return return return fn:sum( fn:sum( for $it for $it in in $p//item $p//item where where $it/price > 20 > return $price) for $c in //cust where for for $p $p in $c/hq in /purchaserq return fn:sum( fn:sum( <private> for for $it $it in in $p//item {$c/fname,$c/lname} where where $it/price > > </private> return return $price) $price) <purchaserq version="1.0"> <cust> <name>department </name>... </cust>... </purchaserq> <custlist version="1.3"> <cust> <name>martin Necasky</name>... </cust>... </custlist> <purchaserq version="1.0"> <cust> <name>department </name>... </cust>... </purchaserq> <custlist version="1.3"> <cust> <name>martin Necasky</name>... </cust>... </custlist> <purchaserq version="1.0"> <cust> <name>department </name>... </cust>... </purchaserq> <custlist version="1.3"> <cust> <fname>martin</fname> <lname>necasky</lname>... </cust>... </custlist> (a) Initial state (b) Upward change propagation (c) Downward change propagation Figure 1.7: Demonstration of evolution methodology 22

35 Each scenario consists of a sequence of change operations which needs to be performed at the upper level to restore the consistency between the current and upper level. The designer selects one of the scenarios manually. (c) (A) The chosen scenario is performed at the upper level automatically. (d) If the PIM level is not reached, steps 3(a-c) are repeated for each operation from the chosen scenario. The upper level becomes the current level. In our sample scenario, the initial change was performed at the PSM level, so the upwards propagation has only one cycle propagation to the PIM level. The result is highlighted in Figure 1.7(b) by the grey arrow pointing to PIM class PrivateCus with two new attributes fname and lname instead of the original attribute name. 4. Second, the downwards propagation is performed for each operation made at the PIM level. For each XML format, the propagation continues until the extensional level is reached: (a) (A) The impact analysis of the change to the lower level is computed automatically and presented to the designer. The mappings between the levels are exploited. Similarly to step 3(a), the designer can rollback the whole propagation process here. (b) (S-A) Similarly to step 3(b), possible scenarios of propagation of the change to the lower level are identified and the designer selects one of them manually. (c) (A) The chosen scenario is performed at the lower level automatically. (d) If the extensional level is not reached, steps 4(a) 4(c) are repeated for each operation from the chosen scenario. The lower level becomes the current level. In our sample scenario, the downwards propagation has an impact on both XML formats as highlighted in Figure 1.7(c) by the grey arrow. As we can see, the change made by the designer is first propagated to the upper levels until the PIM schema is reached. From there, the change 23

36 is propagated down to all the other affected XML formats. Therefore, the designer can make an initial change at any level of any XML format in the framework. The approach naturally preserves the semantic consistency between the XML formats. It facilitates the work of the designer by performing the impact analysis first and then by providing possible propagation scenarios. The designer only selects from the offered possibilities or provides own scenarios when the offered ones are not sufficient or none scenario can be offered. The propagation according to the selected scenario is then automatic. 1.5 Comparative Analysis Naturally, our approach is not the first tool that deals with evolution and change management of XML applications. However, none of the existing approaches focuses on such a complex situation as we do, i.e. a set of overlapping and related XML schemas which can be, in addition, anytime extended with a newly coming schema. All the current approaches solve only a part of this problem. Platform-Specific Level XML schema visualization UML diagram Logical Level XML schema XML schema XML schema XML schema Operational Level XML queries Extensional Level XML XML documents documents XML XML documents documents XML XML documents documents (a) (b) (c) (d) Figure 1.8: Other existing approaches in scope of the framework We can divide the current approaches into several groups depicted in the context of our framework in Figure 1.8. Approaches in the first group (Figure 1.8(a)) consider changes at the logical level and differ in the selected 24

37 XML schema language, i.e. DTD [6, 17] or XML Schema [12, 14, 15, 47]. In general, the transformations can be variously classified. For instance, paper [47] distinguishes migratory (e.g., movements of elements/attributes), structural (e.g., adding/removal of elements/attributes) and sedentary (e.g., modifications of simple data types). The changes are expressed variously and more or less formally. For instance in [14] a language called XSUpdate is described. The changes are then automatically propagated to the extensional level to ensure validity of XML data. Or, there are approaches that deal with incremental validation [12] and correction of XML documents. There also exists an opposite approach that enables one to evolve XML documents and propagate the changes to their XML schema [13]. Approaches in the second and third group (Figure 1.8(b) and 1.8(c)) are similar, but they consider changes at some abstraction of the logical level either visualization [26] or a kind of UML diagram [18]. Both work at the PSM level, since they directly model XML schemas with their abstraction. However, no PIM schema is considered, since all the three groups of approaches consider only a single separate XML schema being evolved. Another open problem related to schema evolution is adaptation of the respective XML queries, i.e. propagation to the operational level. Unfortunately, the amount of existing works is relatively low. Paper [32] gives recommendations on how to write queries that do not need to be adapted for an evolving schema. In [21] the authors consider a subset of XPath 1.0 for which they study the impact changes in an XML schema. These approaches in scope of our framework are in (Figure 1.8(d)). In all the papers cited the authors consider only a single XML schema. In [41] multiple local XML schemas are considered and mapped to a global object-oriented schema. Then, the authors discuss possible operations with a local schema and their propagation to the global schema. However, the global schema does not represent a common problem domain, but a common integrated schema; the changes are propagated just upwards and the operations are not defined rigorously. The need for well defined set of simple operations and their combination is clearly identified in Section 6 of a recent survey of schema matching and mapping [10]. In the field of commercial tools, every tool profiles itself either as a UML tool [1,3,8,45,46] or an XML tool [2,9]. Some of the UML tools offer features for translation UML class diagrams to XML schemas [45], but the translation algorithms are straightforward and not applicable for the situation where a family of schemas describe the common model. 25

38 In the state-of-the-art XML tools [2, 9], the user can use various views, representations and visualizations for editing, querying and validating XML schemas and documents, but they do not target schema evolution. Tools for XML schema comparison are available as well [7]. They operate at the logical level and solving the ambiguity (required in all comparison approaches) is left up to the user. None of the XML tools offer mapping a family of schemas to a common model and complex evolution with propagation. Commercial databases such as Oracle database, IBM DB2 or Microsoft SQL Server supporting schema-aware storage of XML also provide facilities for schema evolution and subsequent document adaptation. They expect a new version of the evolved schema and an adaptation script, but to obtain them and to verify their correctness is up to the user. This gap can be bridged by our approach, which helps the user to evolve the schema and generates the adaptation script automatically. Several approaches to reverse-engineer XML schemas into a conceptual model were developed, some automatic [20,23,50], others semi-automatic [42, 43]. These approaches let the user create a UML diagram from an XML schema(s), but when the process is finished, there is no link between the created model and the schema. The results can help the user to better understand the schema, but further work, such as adding additional schemas or evolving the existing ones, is not supported. 1.6 Thesis Structure and Author s Contributions The work presented above exceeds possibilities of a single person. It has been produced by our research group which currently consists of 5 researchers (3 of them are Ph.D. students). On the other hand, the whole research topic is build on the idea of a common conceptual model specified on top of a set of conceptually related XML schemas. This idea is the result of the original research of the author of this thesis. He then coordinated the work of his colleagues from the research group. This section delimits those parts of the work in the research group covered by this thesis and distinguishes the contributions of the author. Chapter 1 presents the methodology which has been published in [36]. 26

39 Chapter 2 presents results published in [38]. This paper laid the foundations for the work of the whole group and this thesis in particular. It introduced the conceptual model for XML schemas and its complete formal notation. Also several important features were formally proved in the paper including the proof that the expressive power of the conceptual model equals to the expressive power of regular tree grammars. This means that XML schemas expressed in all current XML schema languages can be modeled with the introduced conceptual model. All other publications comprising this thesis reuse the results from this chapter. The author of this thesis is the main author. He produced most of the content of this chapter. The coauthors helped with experimental implementation and testing, related work and proof-reading. Chapter 3 presents recent results published in [44] and further extends them with some additional content. This chapter extends the results from the previous chapter which studied only a relation between the conceptual model and regular tree grammars. However, only one group of XML schema languages is based on regular tree grammars. This includes languages such as XSD [48] or Relax NG [16]. Even though a majority of XML schemas is expressed in these languages in practice, there is emerging another kind of XML schema languages so called pattern-based languages (see [34] for more details). This paper shows that schemas expressed in our conceptual model can also be translated to this kind of XML schema languages, namely to Schematron [22]. The author of this thesis coauthored this work together with his Ph.D. student Jakub Klímek and a master student Soběslav Benda. The author of this thesis contributed with formal notation and formulation of translation principles. The content presented in this chapter has been submitted to the Journal of Universal Computer Science and is currently (Jan 2014) in the major revision status. Chapter 4 presents results published in [28]. While the previous two chapters provide a formal and technical background for the forward engineering part of the methodology presented in Section 1.4.1, this chapter is related to the reverse engineering part presented in Section It supposes that a conceptual schema at the PIM level as well as several schemas at the PSM level already exist. It shows how a newly coming XML schema can be integrated to the framework. It is 27

40 necessary to map the new XML schema to the PIM schema via its automatically created PSM schema. This mapping is created by a designer and our semi-automatic method assists her with mapping suggestions. The author of this thesis is the main author of this publication. It was co-authored with his Ph.D. student Jakub Klímek who contributed with an experimental implementation. Chapter 5 presents results published in [37]. It is the other (besides Chapter [38]) core publication filed in this thesis. It completes the formal and technical background for the methodology the evolution part presented in Section It discusses possible changes at the PIM and PSM levels and shows how changes can be propagated between both levels. Propagation to and from the logical level is not studied since this level is one-to-one mapped to the PSM level (which is shown in Chapter 2). Propagation to and from the extensional level of XML documents is not covered by this thesis. 2. The author of this thesis is the main author. He produced the content related to the formal background and proofs of soundness and completeness of the approach. The main author s contribution is integration of novel kinds of evolution operations (so called synchronization operations) to the evolution framework which significantly increased the power of the change propagation mechanism. These parts form the most of the content of the paper. The coauthoring Ph.D. students helped with experimental implementation and testing. The coauthor Irena Mlýnková created the related work section and cooperated on the formulation of the basic ideas of the 5-level evolution framework and with formulation of the evolution operations. Chapter 6 presents results published in [27]. It further extends the results from the previous chapters with evolution of PIM and PSM schemas with presence of inheritance. The author of this thesis coauthored this work together with his Ph.D. student Jakub Klímek. The author of this thesis contributed with formal notation and formulation of evolution principles in presence of inheritance. Chapter 7 presents results published in [30]. It further extends Chapter 2 with a possibility of modeling integrity constraints expressed in 2 It was covered by the research of another Ph.D. student of the author of this thesis, Jakub Malý 28

41 Object Constraint Language (OCL) [5]. OCL is a language used to express integrity constraints on top of UML class diagrams and is therefore suitable for our techniques. The chapter shows how to express integrity constraints at the PIM and PSM levels with OCL. It also shows how to translate OCL expressions between both levels and between the PSM level and Schematron expressions. The author of this thesis coauthored this work together with his Ph.D. student Jakub Malý. The author of this thesis is the author of the presented translation mechanism between PIM and PSM levels. Chapter 8 provides an evaluation of the methodology presented in Section 1.4 in a form of case studies from two domains public procurement domain and ehealth domain. The first case study was published in [35]. Moreover, it was further extended for purposes of evaluation of results published in other papers. Some of these extensions are present in the previous chapters of this thesis. The author of this thesis is the only author of the experimental results presented in this chapter and also in the other chapters. The experiments were performed using the tool exolutio which implements the results presented in the papers involved in this theses. The tool was developed by Ph.D. students Jakub Klímek and Jakub Malý. As can be seen from the summary above, this thesis is a compilation of existing publications of the author. Therefore, the reader will find out that certain parts of the following chapters repeat. In particular, each chapter repeats definitions of the conceptual model for XML schemas. There can be slight differences between these definitions. This is due to the fact that each paper deals with a different problem and each needs to reveal different details of the model while other details are not important. This thesis presents only the most important papers published by the author of the thesis during years in the area of conceptual modeling for XML. In total, the complete work of the author in this area consists of 9 journal articles with IF and more than 30 papers published in international conference proceedings. 29

42 30

43 Chapter 2 When Conceptual Model Meets Grammar: A Dual Approach to XML Data Modeling Martin Nečaský Irena Mlýnková Jakub Klímek Jakub Malý Published in the International Journal on Data & Knowledge Engineering, volume 72, pages Elsevier, February DOI /j.datak ISSN X. Impact Factor: Year Impact Factor:

44 32

Data & Knowledge Engineering 72 (2012) 1 30 Contents lists available at SciVerse ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.

Mathematics and Physics, Charles University, Malostranske nam.

45 Data & Knowledge Engineering 72 (2012) 1 30 Contents lists available at SciVerse ScienceDirect Data & Knowledge Engineering journal homepage: When conceptual model meets grammar: A dual approach to XML data modeling Martin Nečaský, Irena Mlýnková, Jakub Klímek, Jakub Malý Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, Malostranske nam. 25, Praha 1, Czech Republic article info abstract Article history: Received 8 December 2010 Received in revised form 7 September 2011 Accepted 9 September 2011 Available online 22 September 2011 Keywords: XML schema Conceptual modeling Regular tree grammars Conceptual to XML schema transformation In this paper we introduce a novel approach to conceptual modeling for XML schemas. Compared to other approaches, it allows for modeling of a whole family of XML schemas related to a particular application domain. It is integrated in a well-established way of softwareengineering, namely Model-Driven Development (MDD). It allows software-engineers to naturally model their application domain using a conceptual schema at the platform-independent level of the MDD hierarchy. From there they can design the desired XML schemas in a form of conceptual schemas at the platform-specific level of MDD hierarchy. Schemas at the platformspecific level are then automatically translated to particular XML schemas. Beside this forwardengineering direction, reverse-engineering direction integrating existing XML schemas into the MDD hierarchy is supported as well. We provide several theoretical results which ensure correctness of the introduced approach. We exploit regular tree grammars to formalize XML schemas. We formalize the bindings between the schemas at the two MDD levels and between schemas at the platform-specific level and XML schemas. We prove that conceptual schemas specify the target XML schemas unambiguously. We also prove the expressive power of the conceptual schemas. And, finally, we prove correctness of the introduced translation algorithms between platform-specific and XML schema levels Elsevier B.V. All rights reserved. 1. Introduction The extensible Markup Language (XML) [18] is currently one of the most popular meta-formats for data exchange on the Web. To enable data exchange, it is crucial to restrict the allowed structure of exchanged XML documents so that each communicating party is able to understand them. The structure is restricted by a set of rules called an XML schema. In this paper, we aim at the problem of correct and user-friendly design of XML schemas. At first glance the problem may seem to be solved. There are grammar-based XML schema languages (DTD [18], XML Schema [37] or RELAX NG [24]) which enable expressing XML schemas. Formally, they are based on regular tree grammars as described in [61]. Furthermore, languages called constraint-based XML schema languages (Schematron [43]) enable specification of integrity constraints. And, if we are not satisfied with their textual notation, we can use tools for XML schema visualization, e.g. Altova XML Spy [8]. Research literature shows that XML schema languages are not very friendly and, hence, extends them with conceptual modeling techniques. The idea is similar to the one in the world of relational databases. A designer creates a conceptual schema of the problem domain. It is then automatically converted to an XML schema. Some works extend the UML class model [2] (e.g. [15,25,30,62,75,77]), ER model [22] (e.g. [5,11,50,54,55,72]) or Object-Role Modeling Notation (ORM) [38] (e.g. [56]); others Corresponding author. Tel.: ; fax: addresses: necasky@ksi.mff.cuni.cz (M. Nečaský), mlynkova@ksi.mff.cuni.cz (I. Mlýnková), klimek@ksi.mff.cuni.cz (J. Klímek), maly@ksi.mff.cuni.cz (J. Malý) X/$ see front matter 2011 Elsevier B.V. All rights reserved. doi: /j.datak

46 2 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 Fig. 1. Schematic visualization of paper contributions. use ontologies [86](e.g. [13,85]). There are also papers which introduce their own models (e.g. [28,74]). And, naturally, commercial tools exist as well, e.g. Enterprise Architect [81]. However, we have identified several drawbacks of these approaches: 1. They require a separate conceptual schema for each single XML schema. In most cases, however, a single XML schema is not suitable for all components of a particular software system. It is necessary to design a whole family of XML schemas. In these cases, a designer needs to create many conceptual schemas not explicitly related to each other. 2. They introduce a set of rules which define an XML schema modeled by a conceptual schema. However, they do not prove that each conceptual schema always models an XML schema. They also do not prove that an XML schema is specified unambiguously. If this is ensured, we say that the conceptual model is defined correctly. Finally, they do not prove the expressive power of their introduced conceptual model. 3. They do not prove that their algorithm for translating a conceptual schema to an XML schema follows the introduced rules. When this is proved, we say that the translation algorithm is defined correctly. Moreover, they focus solely on a single translation direction, but they do not show whether it can be reversed. In that case, it is also necessary to show that the composition of both translation directions always leads to the original XML schema. When this is ensured, we say that both directions are mutually consistent. 4. They do not sufficiently consider how designers work when they create conceptual or XML schemas. A designer may introduce unreachable components into the schemas (i.e. components which cannot be instantiated in XML documents). Since some parts maybe unnecessarily complex, a technique which prevents from these situations would make the schemas simpler and more readable. 5. They do not sufficiently study the problem of designing integrity constraints for XML. To overcome the first of the mentioned drawbacks, we have already introduced a conceptual model for XML called XSEM [64,65,70]. It follows the principles of the Model-Driven Development (MDD) [57]. MDD is a well-established softwareengineering methodology and, therefore, most software engineers are familiar with it. First, a designer creates a conceptual schema in a platform-independent model (called PIM schema). Second, (s)he designs a schema in a platform-specific model (called PSM schema) for each desired XML schema on the basis of the PIM schema. It specifies how a selected part of the PIM schema is represented in the target XML schema and how it is automatically converted to an expression in a selected XML schema language. In this work, we extend our approach with a complete formal model. On its basis we focus on the second, third and fourth mentioned drawback. (Note that we do not focus on the last one. We leave it for our future work.) We call our approach a dual approach, because it formally interconnects conceptual and schema level by applying the MDD principles. This provides a possibility to switch from one level to the other in both directions. As a consequence, it also naturally allows us to apply the wellknown MDD methodologies called forward-engineering and reverse-engineering in the process of designing XML schemas. The results presented in this paper are fully implemented in a new tool exolutio. 1 Particular contributions of this work are outlined in Fig. 1. It shows the two MDD levels platform-independent and platformspecific. The former one contains a single PIM schema. The latter one contains a PSM schema for each desired XML schema. They represent the conceptual level. The figure also depicts the logical level. It consists of a schema level with XML schemas and 1

47 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) extensional level representing XML documents valid against the XML schemas. It is useful to express an XML schema in a machine interpretable notation (e.g. for automatic validation of XML documents). There exists various languages as we discussed at the beginning. In this paper, however, we work with the formalism of regular tree grammars [61] instead of a particular XML schema language. The basis of our solution is a formal model of relationships between PIM, PSM, regular tree grammars and XML documents. They are called interpretations and are depicted in the figure as solid bottom-up arrows. An interpretation is a mapping from a lower level to an upper level. For example, a PSM schema has an interpretation against the PIM schema or an XML document has an interpretation against a regular tree grammar. Interpretations and regular tree grammars allow us to study and prove the described properties formally. We have already published parts of the formalism in a conference paper [67]. This includes regular tree grammars, definitions of PIM and PSM schemas and interpretations between PIM and PSM schemas and between PSM schemas and regular tree grammars. In this paper, we further extend these results. We present an extended revised version of the definitions of PIM and PSM schemas and we describe in a much more detail the expressive power of our PSM schemas. We further extend the formalism of interpretations. We introduce a full algorithm for translating PSM schemas to regular tree grammars and vice versa (depicted in Fig. 1 with a dashed arrow connecting the PSM schema 1 and regular tree grammar RTG 1). The main theoretical result of this paper is the proof that the conceptual model and translation algorithms in both directions are defined correctly. The former one is ensured by Theorem 5.1 which says that any two regular tree grammars with an interpretation against the same PSM schema have the same set of valid XML documents (this is depicted in the figure by the regular tree grammar RTG X). The latter one is ensured by Theorems 6.2 and 6.3 which say that the introduced translation algorithms always create an interpretation between a PSM schema and a regular tree grammar. Finally, we introduce techniques for removing unreachable parts of PSM schemas (normalization) and for replacing too complex parts of PSM schemas with simpler ones (optimization). Outline. This paper is organized as follows: Section 2 provides a real-world example for our work. Section 3 introduces a basic formalism used throughout the paper including an XML data model and regular tree grammars. Section 4 introduces our conceptual model for XML. Section 5 formalizes bindings between PIM and PSM schemas and PSM schemas and regular tree grammars. It also proves that the conceptual model is defined correctly. Section 6 introduces algorithms for translating conceptual schemas to regular tree grammars and conversely regular tree grammars to conceptual schemas. It contains a proof that the algorithms are defined correctly. It also discusses the problems of normalization and optimization of PSM schemas. Section 7 briefly presents our experimental implementation of the proposed models and algorithms. Section 8 provides a more detailed study of related approaches and Section 9 discusses related open issues. Section 10 summarizes the work and concludes. 2. Motivation In this section, we present a real-world example of a portal of the Czech government the National Register for Public Procurement (NRPP) which demonstrates the applicability of the presented results. The NRPP receives information on public contracts encoded in XML from various authorities. It defines several XML schemas. We demonstrate two of them in Fig. 2. The XML document on the left contains a call for contract suppliers. The XML document on the right contains an announcement about a selected supplier for the contract. As it is possible to see, both XML schemas have some shared parts, some of which are equivalent semantically as well as structurally. For example, information about the public contracting authority is represented in both XML schemas with child elements of cont. They also share the contract title (element title). Some parts, however, are equivalent only semantically. For example, information about the estimated price is represented in both: on the left, element price_total is used, while on the right, there is element estim_price. Fig. 2. Sample XML document with public contract announcement.

48 4 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 This brings several problems. The description of the public procurement domain is disseminated across various XML schemas. Therefore, it is hard to understand the domain only from the XML schemas. It is also hard to decide whether the description is complete and consistent. For example, we would like to check whether the estimated price of a public contract is represented with the same level of granularity and with the same elements. On the other hand, if a concept or relationship is not represented in any XML schema, it is not designed anywhere and may be forgotten. For example, both depicted XML schemas do not represent a contact person for a public contract. However, this entity is an important part of the application domain and should be modeled. Our introduced approach helps to solve these problems. It extends the XML schemas with a common conceptual schema and all XML schemas are bound to the conceptual schema. 3. Preliminary notation and definitions In this section, we will introduce a common notation used in the rest of the paper. First, we present symbols denoting frequently used sets. Then we introduce the XML data model and the formalism of regular tree grammars. Finally, we discuss the relationship between regular tree grammars and current XML schema languages Common notation We use L to denote an infinite set of string labels. D denotes a finite set of data types. A data type D D is a possibly infinite set of data values. D denotes the union of all considered data types (D ¼ D D D). This paper is independent of a concrete data type system (i.e. set of all considered data types) and we, therefore, do not introduce specific data types. C ={m..n:m IN 0 Λ n (IN 0 {*}) (m n n = *)} denotes an infinite set of cardinality constraints, wherein 0 denotes the set of natural numbers including 0. And, 2 x and 2 (x) denote the set of all subsets and ordered subsequences of a set X, respectively. Last but not least we need to formally define the notion of regular expressions. Definition 3.1. A regular expression re over a set N is an expression generated by the following rule RE: RE ::¼ Z RE m ::n ðreþ j RE; RE j REjRE j freg where Z N and m..n stands for a cardinality from C. A regular expression of the form RE, RE or RE RE or {RE} is called sequence or choice or content model, respectively XML data model We represent XML documents as XML trees. There are two approaches to defining XML trees in the current literature. The graph-theory approach handles an XML document as a graph with nodes representing XML elements and attributes (and possibly text nodes and other kinds of XML constructs) and edges representing the hierarchical structure of the XML document. The grammartheory approach handles an XML document as a string expression with a tree structure. The required structure of the string expression is described by a set of grammatical rules which generate the expression. In this paper, we will use the grammar-theory approach. Definition 3.2. An XML tree τ is an expression with structure described by the following rule T: T ::¼ l½ff a g; ðf e ÞŠ j lv ½ Š; F a ::¼ F a ; F a lv ½ŠjðÞ; F e ::¼ F e ; F e j T jðþ where () is an empty expression, v D* and l L. A l [v] is called XML attribute with a name l and value v, l[{a 1,,a m }, (e 1,,e n )] is called XML element with a complex content with name l, XML attributes a 1,,a m with distinct names and child XML elements e 1,,e n, l [v] is called XML element with a simple content with name l and simple content v. Example 3.1. A sample XML tree is depicted in Fig. 3(b).Fig. 3(a) shows the corresponding XML document. Definition 3.2 unifies the notion of XML tree and XML element an XML tree τ is in fact an expression which is an XML element at the same time. This XML element is called root XML element of τ. Conversely, each XML element which is a subexpression of τ is also XML tree on its own. We will call it XML sub-tree of τ. Note that Definition 3.2 does not support elements with a mixed content. An element with a mixed content contains not only child elements but also values from D* arbitrarily mixed. Moreover, an element having a simple content and attributes at once is not allowed. This is restrictive but not fundamental for our work.

49 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) (a) (b) Fig. 3. Sample XML document (a) and its XML tree (b) Regular tree grammars As we have already mentioned, XML designers may use various XML schema languages to express their XML schemas. However, the key aim of this paper is the description of the (expressive) power and other properties of the conceptual model of our approach from the formal point of view. Therefore, we need to formalize the XML schema languages to abstract from technical and syntactical details. For this purpose we exploit the acknowledged formalization using regular tree grammars [61] which is usual in research literature [78,84]. Naturally, the formalism is not established for the purpose of practical usage in such case the XML schema languages should be used. Definition 3.3. A regular tree grammar is a 4-tuple G=(N, T, S, P), where: N is a finite set of non-terminals. TpL is a finite set of terminals. P is a set of production rules of one of the following forms: - t [D], where Z N, t T and D D. Z is called XML attribute declaration. - Z t [D], where Z N, t T and D D. Z is called XML element declaration with a simple content. - Z t [re], where Z N, t T, and re is a regular expression. Z is called XML element declaration with a complex content. For each production rule, the part on the right or left of is called right-hand side or left-hand side of the production rule, respectively. For each non-terminal Z there is one and only one production rule with Z on the right-hand side. N a pn denotes the set of all XML attribute declarations and N e pn denotes the set of all XML element declarations (with a simple or complex content) in N. SpN e is a set of XML element declarations called initial non-terminals. Definition 3.3 is a slight extension of the original one [61]. First, we distinguish XML element and XML attribute declarations. Second, we distinguish multiple data types for contents of XML elements and attributes instead of a single generic data type PCDATA borrowed from DTD. Example 3.2. A sample regular tree grammar is depicted in Fig. 4. It describes the structure of the XML tree from Fig. 3. For example, non-terminal Pur is an XML element declaration with a complex type, non-terminal City is an XML element declaration with a simple type, and non-terminal Code1 is an XML attribute declaration. In Definition 3.3, we used the notion of regular expression (Definition 3.1). Note that the introduced language for regular expressions over a set N is still not fully sufficient. It does not allow for describing XML elements with a mixed content and XML elements which have both simple content and attributes. This is consistent with the restrictions of Definition 3.2. Classification. Now, we can define classes of grammars that correspond to particular XML schema languages. We recall definitions of these classes from [61]. In particular, there are local-tree grammars that correspond to DTD and single-type tree grammars that correspond to XML Schema. RELAX NG corresponds to general regular tree grammars. [61] shows that local-tree grammars

50 6 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 Fig. 4. Sample regular tree grammar. are weaker than single-type tree grammars which are, obviously, weaker than regular tree grammars. Firstly, we define necessary competition of non-terminals. Definition 3.4. Let us have a regular tree grammar G=(N, T, S, P). Two non-terminals Z 1,Z 2 N are competing with each other if there exist two different XML element or XML attribute declarations with the same terminal on the right-hand side. Definition 3.5. A local tree grammar is a regular tree grammar without competing non-terminals. A tree language is a local tree language if it is generated by a local tree grammar. Definition 3.6. A single-type tree grammar is a regular tree grammar such that for each production rule, non-terminals in its content model do not compete with each other, and start symbols do not compete with each other. A tree language is a single type tree language if it is generated by a single type tree grammar. Having introduced the syntax of regular tree grammars, we now focus on their semantics. In other words, we focus on the language (=set of XML trees) generated by a regular tree grammar. Firstly, we describe the semantics of regular expressions. In the theory of regular expressions, a regular expression re specifies a language L(re) of words over an alphabet of the regular expression. In our case the alphabet is the set N of non-terminals of a given regular tree grammar. A word belongs to L(re) if it matches re. Definition 3.7. Let G=(N, T, S, P) be a regular tree grammar, re a regular expression, and Z a word over an alphabet N (i.e. Z is an ordered sequence of non-terminals from N). We say that Z matches re (Z L(re)) iff re=z N Z=Z, re=re m..n Z=Z 1 Z k m k n ( 1 i k) (Z i L(re )), re=(re ) Z L(re ), re=re 1,re 2 Z =Z 1 Z 2 Z 1 L(re 1 ) Z 2 L(re 2 ), re=re 1 re 2 (Z L(re 1 ) Z L(re 2 )), or re ={re } ( permutation Z of members of Z) (Z L(re )) A regular expression which is a non-terminal Z matches only a singleton sequence containing Z itself. Parentheses ( and ) are used to specify operator priority. An expression re m.. n specifies that sequences matching re may be repeated from m to n times. An expression re 1,re 2 specifies that a sequence matching the expression starts with a sequence matching re 1 followed by a sequence matching re 2. A sequence matching re 1 re 2 must match re 1 or re 2. And, finally, a sequence matching {re } is each sequence which may be reordered so that it matches re. Example 3.3. Let us suppose our sample regular tree grammar from Fig. 4. Suppose an ordered sequence of non-terminals Z =Login Nm Phone Phone and a regular expression re=login, Nm, {Phone 1.. *, }. We show that Z matches re. The sequence Login Nm matches the regular expression Login, Nm. The sequence Phone Phone matches the regular expression {Phone 1.. *, 0..1 }, because its reordering Phone Phone matches Phone 1.. *, This is because Phone Phone matches Phone 1.. * and matches

51 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) We may now proceed to the semantics of regular tree grammars. A regular tree grammar G specifies a language L(G) whose words are XML trees satisfying the restrictions given by G. In[61], the authors introduced the notion of interpretation of an XML tree against G. It is a mapping of XML elements and attributes of the XML tree to the non-terminals of G. If such mapping exists, the XML tree belongs to L(G). We extend this definition in Definition 3.8. Definition 3.8. An interpretation I rtg of an XML tree τ against a regular tree grammar G=(N,T,S,P) is a function from each XML element and attribute x of τ to a non-terminal from N, denoted I rtg (x), such that: If x is the root element, then I rtg (x) S. If x=l[v], then (I rtg (x) l[d]) P s.t. D D v D. If x=@l[v], then (I rtg P s.t. D D v D. If x=l [{x 1,,x m1 }, (x m1 +1,,x m1 + m 2 )], then (I rtg (x) l[re]) P s.t. I(x π(1) ) I(x π(m1 + m 2 )) L(re), where π is a permutation which preserves the order of child XML elements in the content of x. We will call I rtg (x) interpretation of x against G. We will say that x is valid against I rtg (x). The first rule requires that the interpretation of a root XML element is an initial non-terminal. The second rule requires that the interpretation of an XML element with a simple content is an XML element declaration with a simple content. The corresponding production rule must have a terminal equal to the XML element name and its data type must contain the data value of the XML element. The third rule is similar to the second one but restricts interpretations of XML attributes. And, finally, the fourth rule requires that the interpretation of an XML element with a complex content is an XML element declaration with a complex content. The corresponding production rule must have a terminal equal to the XML element name. Moreover, it must be possible to construct a sequence composed of interpretations of child XML elements of the XML element (in the original order) mixed with interpretations of its XML attributes (in any order) which matches the regular expression of the production rule. Example 3.4. Our sample XML tree depicted in Fig. 3(b) has an interpretation I rtg against the sample regular tree grammar depicted in Fig. 4. I rtg is intuitive and we do not list it here. Finally, we borrow the definition of validity of an XML tree against a regular tree grammar from [61]. Definition 3.9. Let G=(N,T,S,P) be a regular tree grammar. The XML language generated by G is the set L(G)={τ:τ is an XML tree with an interpretation against G}. We say that an XML tree τ is valid against a regular tree grammar G iff τ L(G). 4. Conceptual model for XML A regular tree grammar describes an XML schema at the syntactical level. In this section, we show how it can be described at the conceptual level using a conceptual schema. As mentioned before, we follow the MDD principle which is based on modeling the application domain at several levels of abstraction. The most abstract level we adopt in this paper is the platform-independent level. It contains the conceptual schema of the problem domain. The language applied to express the conceptual schema is called platform-independent model (PIM) and the conceptual schema is then called schema in the platform-independent model (PIM schema). We introduce PIM in Section 4.1. The level below is the platform-specific level which specifies how the whole or part of the PIM schema is represented in a particular platform. In our case, the platform is the formalism of XML trees introduced in Section 3. The language applied at this level is called platform-specific model (PSM) and a schema expressed in this language is called schema in the platform-specific model (PSM schema). We introduce PSM in Section Platform-independent model (PIM) There are two possible approaches to expressing a PIM schema ontological and software-engineering. The ontological approach uses an ontological language such as OWL [86] and allows for describing the complete semantics of the application domain in a machine processable way. On the other hand, the software-engineering approach uses a type of software-engineering structural modeling language such as Unified-Modeling Language (UML) [2,3], Entity-Relationship Model (ERM) [22] or Object-Role Modeling (ORM) [38]. The expressiveness of the resulting conceptual schema is generally lower in the case of the softwareengineering approach in comparison to the ontological one. Therefore, a knowledge engineer would probably use the ontological approach to capture all the semantic details of the problem domain. On the other hand, a software or data engineer would prefer the software-engineering approach since OWL may be unnecessarily complex. Moreover, UML, ERM or ORM are currently widely supported by software-engineering case tools which is not the case of OWL. In this work, we follow the software-engineering approach. In particular, we employ the language of UML class diagrams. The reason for this is its widespread usage by software engineers and its support in existing tools. Let us note that there is Object-Constraints Language (OCL) [1]. OCL extends UML with possibilities for expressing general integrity constraints. Therefore, it significantly increases the expressive power of UML. However, due to a lack of space we will not concern ourselves with OCL for purposes of conceptual modeling for XML in this paper. Still, OCL is worthy of study and we mention possible directions in Section 9.

52 8 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 We do not consider all kinds of constructs offered by UML class diagrams, but only those concerned with classes, class attributes and binary associations. We do not restrict classes connected by an association, so it is possible to model recursive binary associations. Classes, attributes and associations have several properties which are listed in the definition. We do not consider class operations as they are not important for data modeling. Also, we do not consider special kinds of associations such as aggregation or composition for simplicity. We omit inheritance and sub-typing of primitive data types (e.g. sub-typing a general data type String to a more specific data type ) in this paper. This is an important and non-trivial open problem which we are currently working on and it deserves a separate paper especially in our environment of multi-level modeling. Since two different associations can connect the same pair of classes and the name of the associations is optional, we introduce the notion of association end to distinguish between them. Therefore, we formally define an association as a set of two association ends (and, consequently, associations in our PIM are unordered). Each association end is then related to a class, so an association connects its two classes via its two association ends. Two associations connecting the same classes are distinguishable because they are formed by different association ends. Definition 4.1. A schema in the platform independent model (PIM schema) is a 9-tuple S =ðs c, S a, S r, S e, name, type, class, participant, card), where: S c and S a are sets of classes and attributes, respectively. S r is a set of binary associations. S e is a set of association ends. A binary association is a set R={E 1,E 2 }, where E 1, E 2 S e.it must hold that ð E S e Þð!R S r ÞðE RÞ. In other words, every association end is a member of exactly one association. name : S c S a L resp. name : S r L fλg assigns a name to each class, attribute and association. name(r) = λ means that R S r does not have a name. type : S a D assigns a data type to each attribute. class : S a S c assigns a class to each attribute. For A S a, we will say that A is an attribute of class(a). participant : S e S c assigns a class to each association end. For R ={E 1,E 2 } S r, we will say that participant(e 1 ) and participant(e 2 ) are participants of R or that they are connected by R. card : ðs a S e Þ C assigns a cardinality to each attribute and association end. The members of S c, S a, and S r are called components of S. A class is displayed as a box with its name at the top and attributes at the bottom. An attribute is displayed as a pair comprising the attribute name and cardinality. The data type is omitted to make the diagram easy to read. An association is displayed as a line connecting participating classes with the (from the definition optional) association name and cardinalities. If the cardinality is 1..1, it is not displayed (in case of attributes as well as associations). Example 4.1. A sample PIM schema is depicted in Fig. 5. It models the domain of purchasing and supplying goods. For example, there is a class C Customer which models customers. A customer has a login, name, phone and one or more s. These characteristics are modeled by the attributes A login, A name1, A phone1 and A 1, respectively. The former three have cardinalities The latter one has a cardinality 1... Similarly, a class C Purchase models purchase orders. Each purchase is characterized by its code, creation date and status modeled by the attributes A code1, A date1,anda status, respectively (we use numbers to distinguish attributes in different classes having the same name as in the case of A name1 and A name2 belonging to classes C Customer and C Supplier respectively). An association R ordered models a relationship between a customer and purchase which says that the customer made the purchase. The cardinalities of its ends specify that a particular customer has made one or more purchase orders. Conversely, a given purchase order was made by one and only one customer. Address - street - city - country - gps [0,1] bill-to Purchase 1..* ordered Customer - login - name - phone - [1,*] Supplier ship-to - code - date - status - name - phone - contains provides 1..* Item - amount - price - tester 0..* 1..* 0..* delivers Product - code - price - color 1..* Supply - date - price - amount Fig. 5. Sample PIM schema.

53 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) According to Definition 4.1, the components of our sample PIM schema are as follows: n o S c ¼ C Address ; C Purchase ; C Customer ; C Supplier ; C Item ; C Product ; C Supply n S a ¼ A street ; A city ; A country ; A gps ; A code1 ; A date1 ; A status ; A login ; A name1 ; A phone1 ; A 1 ; A name2 ; A phone2 ; A 2 ; o A amount1 ; A price1 ; A tester ; A code2 ; A price2 ; A color ; A date2 ; A price2 ; A amount2 n o S r ¼ R bill to ; R ship to ; R ordered ; R contains ; R 1 ; R delivers ; R provides Formally, each class is given a name by the function name, e.g. name(c Purchase )=Purchase, name(c Supplier )=Supplier. Each attribute is associated with its class by the function class, e.g. class(a login )=class(a name1 )=class(a phone1 )=class(a 1 )=C Customer.Moreover, it is given a name, data type, and cardinality by the functions name, type, and card, respectively. E.g. name(a 1 )= , type(a 1 ) =string, card(a 1 )=1... Each association is a set of two association ends, e.g. R bill to ={E bill to 1, E bill to 2 }, R ship to ={E ship to 1, E ship to 2 }, R ordered = {E ordered 1,E ordered 2 }. An association is given an optional name by the function name, e.g. name(r bill to )=bill to, name(r ship to )= ship to, name(r 1 )=λ, whereλ means that the name has not been given. Each association end is associated with a class by the function participant, e.g. participant(e bill to 1 )=C Address, participant(e bill to 2 )=C Purchase, participant(e ship to 1 )=C Address, participant(e ship to 2 )=C Purchase.Inotherwords,bothR bill to and R ship to have participants C Address and C Purchase. Next, each association end is given a cardinality by function card, e.g. card(e order 1 )=1..1, card(e order 2 )= Platform-specific model (PSM) In the previous text, we showed that data may be modeled from two different points of view. Section 3.3 showed how XML schemas may be modeled at the syntactical level. Section 4.1 showed how the data represented by the XML schemas may be modeled at the conceptual level by a PIM schema independently of the XML schemas. In this section, we introduce PSM schemas which serve as the glue between both levels. A PSM schema is a UML class diagram which models a particular XML schema. In other words, it specifies what XML elements and XML attributes, which form the XML schema, are used to represent a part of the PIM schema. In general the key constructs of PSM schemas have the following characteristics: An attribute models an XML attribute declaration or XML element declaration with a simple content. This depends on a new characteristics of attributes called XML form. The attribute name, type and cardinality specify the XML attribute or XML element name, type and cardinality, respectively. A class with a named parent association models an XML element declaration with a complex content. The association name specifies the XML element name. The cardinality of the class in the association specifies the XML element cardinality. The regular expression of the XML element declaration is a sequence comprising XML attribute and XML element declarations modeled by the class attributes and associations going from it. A class with a non-named parent association models only a sequence of XML attribute and XML element declarations modeled by the class attributes and associations going from it. The sequence becomes a subexpression of the regular expression modeled by the parent. We, moreover, introduce a new construct called content model. It allows for modeling a particular sequence, choice or set content model in regular expressions. There is also another new construct called structural representative. It enables one to reuse a regular expression already modeled by another class. We distinguish a specific root class called schema class. It says that its child classes model initial non-terminals (i.e. root XML elements). Definition 4.2. A schema in a platform-specific model (PSM schema) 15-tuple S = S c, S a, S r, S e, S m, C S, name, type, class, xform, participant, card, cmtype, pos, repr ), where S c, S a, and S m are sets of classes, attributes, and content models, respectively. S r is a set of directed binary associations. S e is a set of association ends. A directed binary association is a pair R =(E 1,E 2 ), where E 1, E 2 S e and E 1 E 2 (the order in the pair indicates the direction of the association, i.e. E 1 is the source, E 2 the target). It must hold that E S e!r Sr ðe R Þ. In other words, every association end is a member of exactly one association. C S S c is a class called schema class. name : S c S a L resp. name : S r L fλg assigns a name to each class, attribute and association. type : S a D assigns a data type to each attribute. class : S a S c assignsaclasstoeachattribute.fora S a, we will say that A is an attribute of class (A )ora belongs to class (A ). participant : S e S c S m assigns a class or content model to each association end. For R =(E 1,E 2 ), where X 1 =participant (E 1 ) and X 2 = participant (E 2 ), we call X 1 and X 2 parent and child of R, respectively. We will also sometimes call both X 1 and X 2 participants of R and say that X 1 is the parent of X 2 and X 2 is a child of X 1, denoted parent (R ) andchild (R ), respectively.

54 10 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 xform : S a fe; ag assigns an XML form to each attribute. It specifies the XML representation of an attribute using an XML element declaration with a simple content or an XML attribute declaration, respectively. card : S a S e C assigns a cardinality to each attribute and association end. cmtype : S m fsequence; choice; setg assigns a content model type to each content model. We distinguish three types: sequence, choice and set, respectively. pos : S a S r IN assigns a position to each attribute or association. For an attribute A S a, pos (A ) specifies the position of A in the ordered list of all attributes of class (A ). For an association R S r, pos (R ) specifies the position of R in the ordered list of all associations having parent (R ) as their parent. repr : S c C S Sc C S assigns a class C to another class C. C is called structural representative of C. It must hold that C repr C where repr C denotes the set repr C repr repr C. Neither C, nor C can be the schema class. The graph S c S m ; S r with classes and content models as nodes and associations as directed edges must be a directed forest with one of its trees rooted in the schema class C S. Members of S c, S a, S r, and S m are called components of S. To distinguish PIM and PSM constructs, we supplement PSM constructs with the symbol (e.g. PSM class C Customer ). PIM constructs are strictly referred to without that symbol (e.g. PIM class C Customer ). A PSM schema is displayed as a UML class diagram in a tree layout. Classes and attributes are displayed in a classical way with several extensions. The name of an attribute A is prefixed if xform (A )=a. Moreover, for a structural representative C of a class C we display the name of C above the name of C. Associations are displayed directed from their parent to child. Content models are displayed as small rounded rectangles. To distinguish sequence, choice and set we display,, or {} in the rectangle, respectively. We will also use two auxiliary functions: attributes : S c 2 ðs a Þ associates each class with an ordered list of all its attributes ordered by pos function. content : S c S m 2 ðs r Þ associates each class or content model with an ordered list of all associations having it as their parent ordered by pos function. We call content (X ) content of X. For a given PIM schema there may be more different PSM schemas each related to a part of the PIM schema and specifying a separate XML schema. Therefore, a PSM schema may be viewed from two perspectives. From a conceptual perspective, it is a part of a PIM schema and says what part of the reality is represented by the XML schema. From a grammatical perspective, it models a regular tree grammar, or, in general, an XML schema. Therefore, it has a hierarchical structure of a rooted tree with associations directed from their parent to their child. In Section 5 we will formalize both perspectives and study the properties of PSM schemas and modeled XML schemas formally. Example 4.2. For illustration, we refer to three sample PSM schemas depicted in Fig. 6. From the conceptual perspective they represent various parts of the PIM schema depicted in Fig. 5. From the grammatical perspective, they model three different XML schemas. In particular, Fig. 6(a) models the XML schema from Fig. 4. Let us illustrate the definition and the perspectives using Fig. 6(a) in a more detail. From the conceptual perspective, it represents a part of the PIM schema in Fig. 5 comprising classes C Address, C Purchase, C Customer, C Item, and C Product. From (a) PSM schema of purchase order request Fig. 6. Sample PSM schemas. (b) PSM schema of purchase order response (c) PSM schema of supply list (external)

55 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) perspective, it models the regular tree grammar depicted in Fig. 4 which describes an XML schema for purchase requests. In other words, it shows how data modeled by the part of the PIM schema is represented in the XML schema modeled by the regular tree grammar. According to Definition 4.2, the components of the PSM schema are as follows: n o S c ¼ C PurchRQSchema ; C Purchase ; C ShipAddr ; C BillAddr ; C Customer ; C Items ; C Item ; C Product ; C ItemAmount ; C ItemTester ; C Address n o S a ¼ A code1 ; A date ; A version ; A gps ; A login ; A name ; A ; A code2 ; A amount ; A price ; A tester ; A street ; A city ; A country n o S r ¼ R purchaserq ; R shipto ; R billto ; R cdetail ; R items ; R item ; R Item Product ; R Item M ; R M ItemAmount ; R M ItemTester S m ¼ M with the schema class C PurchRQSchema. Classes may be structural representatives of other classes, e.g. repr (C ShipAddr )=C Address, repr (C BillAddr )=C Address.i.e.C ShipAddr and C BillAddr are structural representatives of C Address.Eachassociationisanordered pair of association ends, e.g. R purchaserq =(E PurchRQSchema 1, E Purchase 1 ). Similar to PIM, each PSM class has a name, e.g. name (C PurchRQSchema )=PurchRQSchema, name (C Purchase )=Purchase. Moreover, it also has an ordered list of its attributes, e.g. attributes (C Purchase )=(A code1, A date, A version ) and its content, e.g. content (C Purchase )=(R shipto, R billto, R cdetail, R items ). Each attribute is associated with its class, e.g. class (A code1 )=class (A date )=class (A version )=C Purchase, and has a name, data type and cardinality. Moreover, it has an XML form, e.g. name (A date )=date, type (A date )=string, card (A date ) = 1..1, xform (A date )=a and name (A )= , type (A )=string, card (A )=0..1, xform (A )=e. Each association has an (optional) name and cardinality, e.g. name (R item )=item, card (R item )=1..* and name (R Item M )= λ, card (R Item M )=1..1. Association ends have participants, e.g. participant (E Purchase 1 )= =participant (E Purchase 5 )=C Purchase. And, participant (E M 1 )=participant (E M 2 )=participant (E M 3 )=M. Each content model has a type, e.g. cmtype (M )=choice, and its content (ordered list of associations with the content model as their parent), e.g. content (M )=(R M ItemAmount, R M ItemTester ) Working with PIM and PSM schemas A designer may work with PIM and PSM schemas in two directions. Firstly, (s)he may design XML schemas in the top-down direction. We also call this direction forward-engineering. The designer initiates the following steps: 1. The application domain is studied and described in a form of a PIM schema. 2. XML schemas which will be required in the system are identified. 3. For each identified XML schema: (a) a corresponding part of the PIM schema, where the relevant concepts and relationships are modeled, is identified, (b) the selected part of the PIM schema is shaped into a PSM schema which models the XML schema, and (c) the resulting PSM schema is automatically converted to a corresponding regular tree grammar (see Section 6.3) or expression in a selected XML schema language. The advantage is clear the designer works at the friendly level of UML class diagrams abstracted from the details of regular tree grammars (or XML schema languages in practice). It also saves time, because shaping the part of the PIM schema to the PSM schema is much easier than manual creation of the regular tree grammar (even expressed in an XML schema language). And, it avoids errors since the overall description of the problem domain is given and, therefore, the designer does not miss important real-world concepts or relationships. And, additionally, (s)he does not add new ones without seeing the impact to the overall PIM schema. Secondly, the designer can integrate existing XML schemas into his/her solution in the bottom-up direction which we also call reverse-engineering. Such XML schema might be a legacy XML schema or an XML schema given by a standardization organization we want to use in our system. In that case, the designer: 1. converts the regular tree grammar of the XML schema automatically to a corresponding PSM schema (see Section 6.4), 2. maps the obtained PSM schema to the existing PIM schema. Again, the advantage is that (s)he works with UML class diagrams it is much easier to map the PSM schema than the regular tree grammar. This might influence not only the mapping but it may also lead to extending or modifying the PIM schema if necessary (for example, when the XML schema introduces new real-world concepts or relationships). Note that in this paper we do not introduce a complete mapping language which would allow for integration of XML schemas in general. In this paper, we work only with a basic one-to-one mapping. Therefore, it is not possible to map any PSM schema to the PIM schema. We mention this problem in Section 9 as an open problem. 5. Formalization of conceptual and grammatical perspective In this section, we provide a formal model of conceptual and grammatical perspectives: the conceptual perspective is formalized in Section 5.1, the grammatical perspective in Section 5.2. Consistent with Definition 3.8, we formalize them as an interpretation. Itisa

56 12 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 mapping from a PSM schema to a PIM schema or from a regular tree grammar to a PSM schema, respectively. In Theorem 5.1 in the end of Section 5.2, we prove that the interpretations ensure that each PSM schema models only a single XML schema. Later in Section 6.3 we show that each PSM schema models an XML schema. Together it shows that our PSM is defined correctly (each PSM schema models one and only one XML schema). We also show that the expressive power of our PSM is the same as the expressive power of the general regular tree grammars Conceptual perspective We first introduce interpretation of a PSM schema against the PIM schema. It maps classes, attributes or associations in the PSM schema to the PIM schema. The mapping specifies the semantics of the PSM schema in terms of the PIM schema. Example 5.1. First, we demonstrate the notion of interpretation of a PSM schema against a PIM schema informally on our sample PSM and PIM schemas. Suppose the PSM schema depicted in Fig. 7(a). From the conceptual perspective, its semantics is modeled by a part of our sample PIM schema depicted in Fig. 7(b). The semantics is expressed as an interpretation. For simplicity, we use such names as PIM and PSM components so that the interpretation may be deduced intuitively by a human reader. For example, the PSM class C Purchase is mapped to the PIM class C Purchase. In other words, the semantics of C Purchase in the sense of the PIM schema is C Purchase. Similarly, the PSM attribute A date of C Purchase is mapped to the PIM attribute A date of C Purchase and the PSM associations R shipto and R billto are mapped to PIM associations R ship to and R bill to. There are also components which are not mapped. These are depicted in gray color. E.g, the schema class C PurchRQSchema or PSM class C Items are not mapped. Also, a PSM attribute A version and PSM association R items are not mapped. In other words, these components have no semantics in the sense of the PIM schema. In general, an interpretation of a PSM class or attribute is a PIM class or attribute, respectively. An interpretation of a PSM association is not a PIM association directly. It is an ordered PIM association which we call ordered image of the PIM association. Definition 5.1. Let R={E 1,E 2 } S r be an association. An ordered image of R is an ordered pair R E1 =(E 1,E 2 ) (or R E2 =(E 1,E 2 )). We will use S r to denote the set of all ordered images of associations of S, i.e. Sr = R S r R E 1 ; R E 2 : Definition 5.2 introduces interpretations formally. It ensures consistency between the PIM schema and the semantics of the PSM schema specified by the interpretation. For example, suppose class C Product and its attribute A code2 from our sample PSM schema. The PSM class is correctly mapped by the interpretation to the PIM class C Product. Therefore, A code2, from the conceptual perspective, belongs to C Product. On the other hand, suppose that A code2 is mapped to the PIM attribute A code1 of PIM class C Purchase. From this, A code2 belongs to C Purchase which is in contradiction with the previous conclusion. This shows that the interpretation cannot be arbitrary. Definition 5.2. An interpretation of a PSM schema S against a PIM schema S is a partial function I : ðs c S a S r Þ ðs c S a S rþ which maps a class, attribute or association from S to a class, attribute or ordered image of an association from S, respectively. For X ðs c S a S rþ,wecalli(x ) interpretation of X. I(X )=λ denotes that I is not defined for X. In that case, we will also say that X does not have (a) PSM schema components with interpretation against 7(b) (b) Part of sample PIM schema Fig. 7. Conceptual perspective of sample PSM schema.

57 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) Let a function classcontext I : S c S a S r S m S c return for a given component X of S the closest ancestor class to X so that I(classcontext I (X )) λ. The following conditions must be satisfied: I C S ¼ λ C S c s:t: repr C λ IC ¼ I repr C A S a s:t: IA λ class I A ¼ I classcontext I A R S r s:t: I child R ¼ λ IR ¼ λ R S r s:t: I child R λ IR ¼ I classcontext I R ; I child R : ð1þ ð2þ ð3þ ð4þ ð5þ The interpretation I(X ) of a component X in the PSM schema determines the semantics of X in terms of the PIM schema. Recall from the definition that I is a function. Therefore, the semantics of a PSM component is determined unambiguously. If I(X )=λ, X has no semantics. The definition shows that only classes (except the schema class), attributes and associations are important from the conceptual perspective. The other components are not considered. As we have already noted, the conditions (1) (5) ensure consistency between the PIM schema and the semantics of the PSM schema determined by the interpretation. Consider the interpretation I of the PSM schema depicted in Fig. 7(a) against the part of the PIM schema depicted in Fig. 7(b). The interpretation can be intuitively deduced from the names of PIM and PSM components. Let us discuss the consequences of the conditions (1) (5) given by the definition on this sample PSM schema: Condition (1) requires that the schema class does not have an interpretation. The schema class is a starting point of the whole PSM schema and does not model any data. Therefore, it is also not reflected in the PIM schema. In our example, C PurchRQSchema is the schema class and, therefore, I(C PurchRQSchema )=λ. Condition (2) requires that a structural representative C of a class repr (C ) has the same interpretation as repr (C ). In other words, it requires that both C and repr (C ) have the same semantics. This is because C acquires the attributes and content of repr (C ). To ensure consistency, the attributes and associations in the content must semantically remain with C. I(C ) repr (C ) would break this rule. In our example, C Address and its two structural representatives C ShipAddr and C BillAddr have the same interpretation C Address. Both representatives acquire the attributes A street, A city and A country. Condition (3) requires that the interpretation of an attribute A of a class C is an attribute A of a class C, where C is the interpretation of C. In our example, the interpretation of A date of C Purchase is an attribute A date of C Purchase. The condition requires that I(C Purchase )=C Purchase. I(C Purchase ) C Purchase would mean that A date is semantically an attribute of C Purchase (since I(A date )=A date which is aclassofc Purchase ). On the other hand, it would also mean that it is not an attribute of C Purchase which is in contradiction with the previous conclusion. The explanation does not work when a class C does not have an interpretation. In that case, A cannot, from the conceptual perspective, belong to C. Instead, it belongs to the closest ancestor class of C which does have an interpretation. We call this class interpreted class context of C. If the context does not exist, A cannot have an interpretation at all. Otherwise, the interpretation must be an attribute A of the interpretation of the context. In our example, this is the case of, e.g., the attribute A amount of class C ItemAmount. The class does not have an interpretation. Therefore, we need the interpreted class context of C ItemAmount which is class C Item. Its interpretation is C Item. Therefore, the interpretation of A amount must be an attribute of C Item.WehaveI(A amount )=A amount which is correct. Condition (4) requires that only an association with an interpreted child may have an interpretation. This is because the semantics of an association specifies how instances of the child are connected to the parent. If the child has no semantics (i.e. does not have an interpretation), we are not able to specify the semantics of the connection to its parent. For example, R items has no interpretation because its child, C Items, has no interpretation. Since C Items has no semantics, there is also no association in the PIM schema connecting C Purchase with a semantic equivalent of C Items. Therefore, it is not possible to specify the semantics of its connection to its parent C Purchase. Condition (5) is similar to (3) but ensures consistency of associations. It is applied when a child C of an association R has an interpretation. In that case it is necessary to specify the semantics of the connection of C to its parent. From the conceptual perspective C is connected to its closest ancestor class which has an interpretation (i.e. interpreted class context of R ). If the context does not exist, R does not have an interpretation, because its semantics cannot be expressed in the terms of the PIM schema. Otherwise, the interpretation of R must be an association in the PIM schema which connects the interpretation of the context of R with the interpretation of C. In our example, the interpreted class context of the association R shipto is the class C Purchase. Its child is C ShipAddr.Condition (4) requires that the interpretation of R shipto is an association connecting the interpretations of C Purchase and C ShipAddr.Theinterpretations of C Purchase and C ShipAddr are, respectively, the classes C Purchase and C ShipAddr and I(R shipto )=R ship to. This satisfies the condition. A case where I(R shipto ) does not connect C Purchase and C Address is semantically inconsistent, because the interpretation of R shipto specifies the semantics of connection between C Purchase and C ShipAddr which can clearly be only an association connecting C Purchase and C Address. Another interesting example is the association R item. Its interpreted class context is C Purchase and child is C Item. The interpretation of R item must, therefore, be an association in the PIM schema connecting C Purchase and C Item. In our example, it is R contains which is also correct. In general, the interpretation does not necessarily need to be so intuitive.

58 14 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 Example 5.2. As an example, suppose the PSM schema depicted in Fig. 8(a). It models an XML schema for product deliveries and its semantics is described by the PIM schema depicted in Fig. 8(b) which is a part of our sample PIM schema depicted in Fig. 5. However, the interpretation is not so intuitive here. If it is not present, it must be created by a human domain expert. The interpretation I is as follows: IC Producer IA name ¼ CSupplier ; IðC Delivery Þ¼C Supply ; IC Item ¼ CProduct ¼ Aname ; IðA delivery day Þ¼A date ; IðA unit price Þ¼A price1 ; IðA num of units Þ¼A amount ; IA item no ¼ Acode1 ðwhere classða price1 Þ¼C Supply ; classða code1 Þ ¼ C Product Þ IðR delivery Þ¼R provides ; IðR Delivery Item Þ¼R delivers : 5.2. Grammatical perspective In this section we introduce interpretation of a regular tree grammar against a PSM schema. It maps grammar non-terminals to components of the PSM schema. Definition 5.3. Let G=(N,T,S,P) be a regular tree grammar and S = S c, S a, S r, S e, S m, C S, name, type, class, xform, participant, card, cmtype, pos, repr ) be a PSM schema. Let S c names = C S : R S r child ðr Þ ¼ C name ðr Þ λ, i.e. Sc names contains each class which is a child of an association with a given name. An interpretation of G against S is a total surjective mapping I :N S c names S a which satisfies the following conditions: Z S X s:t: Z; X I R S r ðparent R ¼ C s child R ¼ X Þ ð1þ ðz td ½ ŠÞ P X s:t: Z; X I X S a type X ¼ D name X ¼ t xform X ¼ e ð2þ td ½ ŠÞ P X s:t: Z; X I X S a type X ¼ D name X ¼ t xform X ¼ a ð3þ ðz tre ½ ŠÞ P X s:t: Z; X I X S c R S r child R ¼ X name R ¼ t re X ð4þ We say that a regular expression re corresponds to a component X S c S a S r S m, denoted re X, when the following conditions are satisfied: For class X S c s.t. attributes (X )=(A 1,,A n1 ), content (X )=(R 1,,R n2 ): - if repr (X )=λ then re =(re 1,,re n1 + n 2 ) ( 1 i n 1 )(re i A i ) ( 1 i n 2 )(re n1 + i R i ) - if repr (X ) λ then re =(re 0,re 1,,re n1 + n 2 ) re 0 repr (X ) ( 1 i n 1 )(re i A i ) ( 1 i n 2 )(re n1 + i R i ) For attribute X S a :re=z card(x ) (Z,X ) I For association X S r s.t. child ðx Þ S c and name (X ) λ: re=z card(x ) (Z,child (X )) I For association X S r s.t. child ðx Þ S c with name (X )=λ or child ðx Þ S m :re=(re r ) card(x ) re r child (a) PSM schema components with interpretation against 8(b) (b) Part of sample PIM schema Fig. 8. Conceptual perspective of sample PSM schema.

59 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) For content model X S m s.t. content (X )=(R 1,,R n ): - if X is a sequence then re=(re 1,,re n ) card(x ) ( 1 i n)(re i child (R i )) - if X is a choice then re=(re 1 re n ) card(x ) ( 1 i n)(re i child (R i )) - if X is a set then re={re 1,,re n } card(x ) ( 1 i n)(re i child (R i )) Note that, in comparison to the previous notions of interpretations, in this case an interpretation is a mapping (not a function). Therefore, it allows a non-terminal to be mapped to more and different components of the PSM schema. Briefly, an XML attribute declaration or an XML element declaration with a simple content must be mapped to a class attribute. An XML element declaration with a complex content must be mapped to a class which is a child of an association with a name. The regular expression on the left-hand side of the production rule must correspond to the class. It requires non-terminals of the regular expression to be mapped to the attributes of the class and the child classes of this class. Moreover, the content models in the regular expression must have the same structure as the content models in the content of the class. Having a regular tree grammar G, PSM schema S and interpretation I of G against S, we will say that S models G or G is modeled by S. In particular, having a non-terminal Z from G and a component X from S s.t. (Z,X ) I, we will say that X models Z or Z is modeled by X. The interpretation is a total surjective mapping. Total means that each non-terminal is modeled by at least one PSM component. Surjective means that each PSM component from S c names S a models at least one non-terminal. On the other hand, the fact that the interpretation is a mapping and not a function means that a non-terminal may be modeled by more and different PSM components. Also, a PSM component may model more and different non-terminals. The rules given by Definition 5.3 further restrict possible interpretations. First, only a child class of the schema class may model an initial non-terminal (rule (1)). Second, an attribute may model only an XML attribute declaration or XML element declaration with a simple content (rules (2) and (3)). Third, a class may model only an XML element declaration with a complex content if it is a child of an association with a specified name. The regular expression of the XML element declaration must correspond to the class (rule (4)). If the class is not a child of an association with a specified name, it does not model any non-terminal. It only represents a part of a complex content of an XML element declaration with a complex content. The relationship correspond to is defined in the second part of the definition. In particular, re corresponds to a class C (re C )if it is a sequence of sub-expressions which correspond to the attributes of the class and associations in the content of C. Moreover, if C is a structural representative of another class repr (C ), re must begin with a subexpression which corresponds to repr (C ). Next, re corresponds to an attribute A if it is a non-terminal with an interpretation A enriched with the cardinality of A. Similarly, it corresponds to an association R with a specified name if it is a non-terminal with an interpretation R enriched with the cardinality of R. On the other hand, if R does not have a name, re must be a regular expression corresponding to the child of R and enriched with the cardinality of R. And, finally, re corresponds to a content model M if it is a sequence, choice or set of regular expressions which correspond to the associations in the content of M, respectively. Example 5.3. Production rules of two sample regular tree grammars which have an interpretation against the PSM schema depicted in Fig. 6(a) are depicted in Fig. 9. The left-hand side grammar is more complex. The right-hand side grammar is an optimized version of the former. For example, non-terminals Code 1 and Code 2 were merged into a single non-terminal Code. The interpretations are intuitive and we do not list them in a full detail. We only demonstrate the conditions of Definition 5.3 on selected production rules of the right-hand side grammar. Pur is an initial non-terminal. Its interpretation is class C Purchase. It satisfies condition (1) of Definition 5.3 which requires the interpretation of an initial non-terminal to be a child of the schema class. GPS is an XML element declaration with a simple content. Its interpretation is attribute A GPS. It satisfies condition (2) of Definition 5.3. Code has two interpretations A Code1 of C Purchase and A Code2 of C Product. It satisfies condition (3) of Definition 5.3. The interpretations specify that Code is used in two semantically different contexts. Once it stands for purchase codes, secondly it stands for product codes. However, they have the same syntax so that they are expressed in the same way in the grammar. Finally, Pur is an XML element declaration with a complex content. Its interpretation is class C Purchase. It satisfies condition (4) of Definition 5.3. It can be easily shown that (Code, Date, Ver, SAd, BAd, Cust, Items) C Purchase. As the example showed, Definition 5.3 allows for two different regular tree grammars modeled by the same PSM schema. Theorem 5.1 shows that all these grammars specify the same XML schema. We present a complete proof of the theorem. In the proof, we exploit Lemma 1 which is introduced after the theorem. Theorem 5.1. Let S be a PSM schema and G 1, G 2 be regular tree grammars modeled by S. Then, L(G 1 )=L(G 2 ). Proof. We have G 1 =(N 1,T 1,S 1,P 1 )andg 2 =(N 2,T 2,S 2,P 2 ). We also have interpretations I 1 and I 2 of G 1 and G 2 against S, respectively. We can prove that an arbitrary XML tree T valid against G 1 is also valid against G 2 and vice versa. In other words, L(G 1 )=L(G 2 ). The basic idea of the proof is to inductively construct an interpretation of T against G 2 on the basis of its existing interpretation against G 1. The proof exploits the fact that I 1 and I 2 are total and surjective mappings. In the rest of the proof, we discuss all technical details of this idea. We suppose an interpretation I 1, rtg of T against G 1 and inductively construct an interpretation I 2, rtg of T against G 2.

60 16 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 Fig. 9. Production rules of two regular tree grammars with interpretation against PSM schema depicted in Fig. 6a. First, let x = l[v] be an XML element with a simple content. Then I 1, rtg (x)=z 1 N 1,where(Z 1 l[d]) P 1 s.t. v D from Definition 3.8. From Definition 5.3 and the fact that I 1 is total, there must be an attribute A S a s.t. (Z 1, A ) I 1,where type (A )=D name (A )=l xform (A )=e. Again from Definition 5.3 and because I 2 is surjective, we have that ( Z 2 N 2 )((Z 2, A ) I 2 (Z 2 l[d]) P 2 ). Therefore, we can put I 2, rtg (x)=z 2. It satisfies the conditions of Definition 3.8. Similarly, we construct an interpretation of each XML attribute in S. Second, let x=l[{x 1,,x m1 },(x m1 +1,,x m1 + m 2 )] be an XML element with a complex content. Then I 1, rtg (x)=z 1 N 1, where (Z 1 l[re 1 ]) P 1 s.t. I 1, rtg (x π(1) ) I 1, rtg (x π(m1 + m 2 )) L(re 1 ) and π is a permutation which preserves the order of the child XML elements. Let I 2, rtg (x 1 ),, I 2, rtg (x m1 + m 2 ) be interpretations constructed recursively. Similar to the previous case, there must be a class C S c s.t. (Z 1,C ) I 1, where R S r (child (R )=C name (R )=l re1 C ) (from Definition 5.3). From Definition 5.3 and because I 2 is surjective, we have that ( Z 2 N 2 )((Z 2, C ) I 2 Z 2 l[re 2 ] P 2 re 2 C ). Lemma 1 shows that in that case I 2, rtg (x π(1) ) I 2, rtg (x π(m1 + m 2 )) L(re 2 ). Therefore, we can put I 2, rtg (x)=z 2 which satisfies the conditions of Definition 3.8. Finally, we need to show that if x is the root XML element of T, theni 2, rtg (x) is an initial non-terminal of G 2.LetZ 1 = I 1, rtg (x) andz 2 = I 2, rtg (x). From Definition 3.8, Z 1 S 1. From Definition 5.3, I 1 (Z 1 ) is a child of the schema class of S. WehaveI 1 (Z 1 )=I 2 (Z 2 )andi 2 (Z 2 ) is, therefore, a child of the schema class of S.IfZ 2 S 2, then there is an XML element declaration Y 2 with Z 2 in its regular expression. According to Definition 5.3, I 2 (Y 1 )isaclassc and I 2 (Z 2 )isanattributeofc or child of an association in the content of C. In both cases, I 2 (Z 2 ) is not a child of the schema class which is in contradiction with the assumption. Therefore, Z 2 = I 2, rtg (x) S 2. In the proof, we refer to Lemma 5.1. It is an auxiliary lemma which we prove separately from the theorem to make the proofs more readable. The recursive construction of the interpretation I 2, rtg of T against G 2 on the basis of the interpretation I 1, rtg of T against G 1 resulted into the fact that any pair of two non-terminals with the same interpretation against S, one from G 1 and the other from G 2, have the same set of valid XML elements and XML attributes from T. The lemma shows that in that case, any two regular expressions over N 1 and N 2, respectively, which correspond to the same component of the PSM schema generate the same sequences of XML elements and XML attributes. Lemma 5.1. Let S be a PSM schema and G 1,G 2 be two regular tree grammars with interpretations I 1 and I 2 against S, respectively. Let T be an XML tree with interpretations I 1, rtg and I 2, rtg of T against G 1 and G 2, respectively. Let I 1, rtg and I 2, rtg satisfy the following assumption: Z 1 N 1 ; Z 2 N 2 s:t: X S Z 1 ; X I 1 Z 2 ; X I 2 ð x TÞ I1;rtg ðþ¼ x Z 1 I 2;rtg ðþ¼ x Z 2 Verbosely, the assumption requires that any two non-terminals of both grammars with the same interpretation against S have the same set of valid XML elements and XML attributes. Then, for each component X of S and each two regular expressions re 1 and re 2 over N 1 and N 2, respectively, s.t. re 1 X and re 2 X : I 1;rtg ðx 1 Þ I 1;rtg ðx m Þ Lre ð 1 Þ I 2;rtg ðx 1 Þ I 2;rtg ðx m Þ Lre ð 2 Þ for any sequence of XML elements and attributes x 1,, x m in T.

61 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) Proof. We will use Z 1 and Z 2 to denote I 1, rtg (x 1 ) I 1, rtg (x m ) and I 2, rtg (x 1 ) I 2, rtg (x m ), respectively. We show the left-to-right direction, the opposite direction is symmetrical. We can distinguish different situations: Let X be an attribute. From Definition 5.3, re 1 = Z 1 card (X ) and re 2 = Z 2 card (X ), where (Z 1, X ) I 1 and (Z 2, X ) I 2.Let Z 1 L(re 1 ). We get I 1, rtg (x k )=Z 1,1 k n. From the assumption, we also have I 2, rtg (x k )=Z 2. Therefore, Z 2 L(re 2 ). Let X be an association. Suppose firstly that name (X ) λ. This case is similar to the previous one. From Definition 5.3, re 1 = Z 1 card (R ) and re 2 = Z 2 card (X ), where (Z 1,child (X )) I 1 and (Z 2,child (X )) I 2. Since Z 1 L(re 1 )wehavei 1, rtg (x k )=Z 1, 1 k n. SinceI 1 (Z 1 )=I 2 (Z 2 ), we also have I 2, rtg (x k )=Z 2. Therefore, Z 2 L(re 2 ). If name (X ) =λ, we have re 1 =(re 1, r ) card (X ) and re 2 =(re 2, r ) card (X ), where re 1, r child (X ) and re 2, r child (X ). By induction, we get that Z 2 is composed of subsequences each from L(re 2, r ). Since Z 1 L(re 1 ), the number of subsequences fits into card (X ). Therefore, Z 2 L(re 2 ). Let X be a class with attributes A 1,, A n1 and content R 1,, R n2.fromdefinition 5.3, re 1 =(re 1, 1,,re 1, n1 + n 2 ), where re 1, i A i, 1 i n 1,andre 1, n1 + i R i,1 i n 2. Similarly, re 2 =(re 2, 1,,re 2, n1 + n 2 ), where re 2, i A i,1 i n 1,andre 2, n1 + i R i,1 i n 2. Therefore, Z 1 = Z 1;1,, Z 1;n1 þn 2,whereZ 1;i L(re 1, i ), 1 i n 1 +n 2.IfwesplitZ 2 to respective Z 2;1,, Z 2;n1 þn 2, we get by induction thatz 2;i L(re 2, i ). Therefore,Z 2 L(re 2 ). Similarly, we can prove the case when X is a structural representative of another class. Let X be a sequence, choice or set content model. In this case, we can prove that Z 2 L(re 2 ) in the same way as in case of X being a class. Theorem 5.1 allows for the following definition of an XML language modeled by a PSM schema. Definition 5.4. A PSM schema S unambiguously determines an XML language denoted LðS Þ which is defined as LðS Þ = L(G), where G is a regular tree grammar with an interpretation against S. Expressive Power of PSM Schemas. As we have already discussed in Section 3.3, [61] introduced local tree grammars, single-type tree grammars and general regular tree grammars. As we will show in Section 6, for each PSM schema, there exists a general tree grammar with an interpretation against the PSM schema. And, conversely, for each general tree grammar (with minor technical exceptions discussed in Section 9), there exists a PSM schema and an interpretation of the grammar against the schema. Therefore, the expressive power of our PSM is of the class of general regular tree grammars. 6. Translations The purpose of this section is to introduce algorithms which translate a PSM schema to a regular tree grammar and vice versa. We start this section with PSM schema normalization and optimization in Sections 6.1 and 6.2, respectively. Following this, we introduce two algorithms which translate a PSM schema to a regular tree grammar and vice versa. We prove that both algorithms create interpretations according to Definition 5.3. The existence of interpretations ensures that the translation algorithms are defined correctly and are mutually consistent (i.e. their composition does not influence the specified XML schema) PSM schema normalization A PSM schema may contain redundant components. A component is redundant if, when removed, the XML schema modeled by the reduced PSM schema is the same as the XML schema modeled by the original one. For example, any root class which does not have any structural representative is redundant. It may be, therefore, removed. We call the process of removing redundancies PSM schema normalization. A PSM schema which does not contain any redundancies is called normalized PSM schema. A PSM schema which contains redundancies is called relaxed PSM schema. It is important to point out that we do not consider redundancies as design errors. There may be plenty of reasons for having redundancies in a PSM schema (e.g. when the PSM schema is in progress and will be further edited by the designer). However, we remove them when translating the PSM schema to a regular tree grammar. Definition 6.1. Let S = ð S c, S a, S r, S e, S m, C S, name, type, class, xform, participant, card, cmtype, pos, repr ) be a PSM schema. We call S normalized PSM schema when the following conditions are satisfied: R content C s name R λ card R ¼ 1::1 ð1þ R content C s child R S c ð2þ R S r child R S m name R ¼ λ ð3þ M S m R S r child R ¼ M ð4þ C S c C S R S r child R ¼ C C 0 S c repr C 0 ¼ C ð5þ If S does not satisfy any of the conditions, it is called relaxed PSM schema.

62 18 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 Condition (1) prevents from associations in the content of the schema class without a name or with a cardinality different from A child of such association without a name does not model an initial non-terminal and, therefore, it does not model any part of XML documents. A cardinality of this association different from 1..1 cannot be reflected in a modeled regular tree grammar because it cannot be added to a regular expression of any production rule. Condition (2) prevents from associations in the content of the schema class which has a content model as its child. Again, such association is not reflected in any production rule and, therefore, the content model is not reflected anywhere as well. Condition (3) prevents from associations with a content model as its child and with a name. A name of such association is never reflected in a modeled regular tree grammar, because the child content model cannot model a non-terminal but only a regular expression used in a production rule of other non-terminal. Condition (4) requires no content model to be a root of the PSM schema. A root content model cannot be reflected in a modeled regular tree grammar because there is no association which models where the regular expression modeled by the content model is placed. And, condition (5) requires each class which is a root of the PSM schema (except for the schema class) to have at least one structural representative. Such class models only a regular expression which is placed as specified by the structural representatives. If there is no representative, the regular expression is not used anywhere. Example 6.1. We explain the definition on a sample PSM schema depicted in Fig. 10(a). Fig. 10(b) shows a regular tree grammar modeled by the PSM schema and its interpretation. The PSM schema is relaxed for various reasons. The association going from the schema class C PurchaseSchema to class C Address1 does not have a name. The association going from the schema class C PurchaseSchema to class C Purchase has a cardinality different from Both break condition (1). The association going to the choice content model breaks condition (2). The association going to the set content model breaks condition (3). And, finally, the root class C Address2 does not have a structural representative and, therefore, breaks condition (5). Any relaxed PSM schema can be normalized. We do not specify the normalization procedurally by a pseudo-code in this paper. Instead, we provide a set of declarative normalization rules depicted in Fig. 11. A rule is specified as a pair of preconditions (above the line) and post-conditions (below the line). If preconditions are satisfied, the rule requires a PSM schema to be modified so that it meets the post-conditions. The rule in Fig. 11(a) normalizes cardinalities different from 1..1 of associations going from the schema class. This cardinality is removed (and becomes implicitly 1..1). The rule in Fig. 11(b) normalizes names of associations going to content models. The name of each such association is removed. The rule in Fig. 11(c) removes each (a) PSM schema with redundancies (b ) Regular tree grammar modeled by 10(a) (c) Normalized PSM schema (d) Regular tree grammar modeled by 10(c) Fig. 10. PSM schema with redundancies, modeled regular tree grammar and PSM schema normalization.

63 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) (a) Cardinality normalization rule (b) Name normalization rule (c) Empty name association normalization rule (d) Content model association normalization rule (e) Root content model normalization rule (f) Root class normalization rule Fig. 11. PSM schema normalization rules. association going from the schema class which does not have a name. The rule in Fig. 11(d) removes each association going from the schema class to a content model. The rule in Fig. 11(c) removes each root content model. And, finally, the rule in Fig. 11(f) removes each root class which does not have any structural representative. Normalization of a PSM schema means iterative application of the normalization rules until the PSM schema is normalized. Example 6.2. The result of normalization of the PSM schema depicted in Fig. 10(a) is shown in Fig. 10(c). Fig. 10(d) shows a regular tree grammar modeled by the normalized PSM schema and its interpretation. It is easy to show that the rules cover all kinds of redundancies introduced by Definition 6.1 and that their proper application leads to a normalized PSM schema. We omit a formal proof in this paper. Theorem 6.1 shows that normalization performed according to the normalization rules is correct. In other words, it shows that the normalized PSM schema models the same XML schema as the original relaxed PSM schema. Theorem 6.1. Let S be a PSM schema and S be a PSM schema which is the result of normalization of S according to the normalization rules. Then LðS Þ = L S. Proof. Let G be a regular tree grammar with an interpretation I against S. We show that when a normalization rule is applied on S, we can construct a regular tree grammar G with an interpretation I against S such that L(G) = LðGÞ. We then get LðS Þ= L S by Definition 5.4. The cardinality normalization rule (see Fig. 11(a)) changes cardinality of an association in the content of the schema class. This cardinality is not reflected in G. We, therefore, put G = G and I = I. From here, we have that I is an interpretation of G against S. The proof is similar for the name normalization rule (see Fig. 11(b)). The empty name association normalization rule (see Fig. 11(c)) removes an association R if R content (C S ) name (R )=λ. Let Z N s.t. I (Z)=child (R ) (Z t[re]) P.ByDefinition 5.3, λ t = name (R )=λwhich is not possible. Therefore, there is no non-terminal with an interpretation child (R ). In other words, we can remove R without affecting G and its interpretation I. We can, therefore, put G = G and I = I and we get I is an interpretation of G against S. The content model association normalization rule (see Fig. 11(c)) removes an association R if R content ðc Þ child (R ) S S m. child (R ) is a content model and, therefore, cannot be an interpretation of any non-terminal. In other words, we can remove R without affecting G and its interpretation I. Therefore, we can put G = G and I = I and we get I is an interpretation of G against S. The root content model normalization rule (see Fig. 11(e)) removes a root content model M and all associations in its content. Again, M cannot be an interpretation of any non-terminal and its removal, therefore, does not affect I. However, removing the associations may affect I. Suppose an association R content (M ). Let Z N s.t. I (Z) =child (R ) (Z t[re]) P.ByDefinition 5.3, child (R ) S c name (R )=t λ.inthatcase,removingr causes that I (Z)isnotacorrect interpretation of Z against the new PSM schema. Suppose Z 0 N with Z 0 t 0 [re 0 ]andzbeing a subexpression of re 0.By Definition 5.3, I (Z 0 ) S c is an ancestor of child (R ). However, this is not possible since child (R ) isachildofm which is a root. Therefore, Z is not used within any regular expression of production rules of G. In other words, if we remove Z and its production rule from G, we get a new regular tree grammar which specifies the same XML language as G. Therefore, we put G as G without the non-terminal Z and production rule Z t[re]. We get L(G) = LðGÞ. Next, we put I as I without the mapping of Z and we get I is an interpretation of G against S. The proof of correctness of the root class normalization rule (see Fig. 11(f)) is similar to the previous rule.

64 20 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 (a) Class merge optimization rule (b) Sequence reduction optimization rule Fig. 12. Formal semantics of optimization function. (a) PSM schema (b) Optimized PSM schema Fig. 13. PSM schema optimization PSM schema optimization A PSM schema, even if normalized, may also contain parts which are unnecessarily complex. They model a part of the XML schema but can be replaced 2 by simpler parts which are equivalent to the original parts grammatically as well as conceptually. This will make the PSM schema more readable and understandable by the designers. We call the process of replacing parts of a PSM with simpler ones PSM schema optimization. A PSM schema where no other parts may be replaced is called optimized PSM schema. It is possible to introduce various optimization rules. The only requirement is that the PSM schema and its optimized version model the same XML schema and that the optimized one must be simpler and more readable for the designers. Simplicity and readability of PSM schemas are subjective. Any case tool implementing our conceptual model may introduce own optimization rules which suit its users. We show two optimization rules in Fig. 12. The former one merges two sibling classes with attributes and empty content to a single class. The latter one removes a sequence content model which is a single child of its parent. In both cases, the optimization does not change the modeled XML schema but the result is simpler and, therefore, better readable for the designers. This will be more obvious in Section 6.4 where we introduce an algorithm which translates a regular tree grammar to a PSM schema. The resulting PSM schema may be hard to read by designers. We will use the two optimization rules to make the result more readable. Example 6.3. Fig. 13(a) shows a non-optimal PSM schema. First, there are two PSM classes which both represent the same PIM class C Customer. The associations going to them have empty names. The PSM classes both have an empty content. Therefore, they can be merged by the rule in Fig. 12(a) as shown in Fig. 13(b). Second, C Purchase in Fig. 13(a) contains the only association in its content which goes to a sequence content model. Its cardinality is Therefore, the content model can be removed by the rule in Fig. 12(b) as shown in Fig. 13(b) Translating a PSM schema to a regular tree grammar First, we show how a normalized PSM schema S may be translated to regular tree grammar G modeled by S. We introduce function psm-2-rtg which takes S as an input and outputs regular tree grammar G modeled by S. We do not specify the function procedurally in a form of pseudo-code. Instead, we specify its formal semantics by rules in Fig. 14. Similar to the normalization 2 Note that they cannot be removed which was the case of normalization.

65 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) (a) XML element declaration with simple content generating rule (b) XML attribute declaration generating rule (c) XML element declaration with complex content generating rule (d) Initial non-terminal generating rule Fig. 14. Formal semantics of psm-2-rtg translation function. and optimization rules, each translation rule has two parts. The upper part specifies preconditions on S. When the preconditions are satisfied for required component(s), a part of G is output as specified by the lower part of the production rule. The rule in Fig. 14(a) generates an XML element declaration with simple content for each attribute A if xform (A ) = e. Similarly, the rule in Fig. 14(b) generates an XML attribute declaration for A if xform (A ) = a. The rule in Fig. 14(c) generates an XML element declaration with complex content for each class C which is a child of a named association. All three rules also generate corresponding non-terminals and terminals. And, finally, the rule in Fig. 14(d) generates an initial non-terminal for each child class of the schema class. The formal semantics of psm-2-rtg exploits two functions: non-terminal and rewrite-down. The function non-terminal takes a PSM component as an input and generates a non-terminal symbol. We do not specify its formal semantics. The only requirement is that it assigns each PSM component in S with a unique non-terminal. The function rewrite-down rewrites a PSM component to a regular expression re which corresponds to the PSM component according to Definition 5.3. Its formal semantics is given by rules in Fig. 15. Again, the upper part of each rule contains a precondition. If it is satisfied for a component of S, a regular expression specified in the lower part is returned. The rules specifying the formal semantics of function psm-2-rtg also generate a mapping I of non-terminals of the resulting G to S. The following theorem shows that I is an interpretation of G against S. In other words, the translation algorithm is defined correctly. Theorem 6.2. Let S be a PSM schema. Let G be a regular tree grammar and I be a mapping of non-terminals of G to S which are the result of applying function psm-2-rtg on S. Then I is an interpretation of G against S. Proof. We show that for each non-terminal Z of G, I (Z) meets the conditions of Definition 5.3. In other words, we show that I is an interpretation of G against S. In particular: Let Z S. Z was put to S by the rule in Fig. 14(d) on the basis of an association R with parentc.sinces is normalized, child (R ) S S c and name (R ) λ. From the rule in Fig. 14(c), (Z,child (R )) I. Let Z t[d] P. It was generated by the rule in Fig. 14(a) on the basis of an attribute A, where name (A ) = t, type (A ) = D, and xform (A ) = e. From the same rule, (Z,A ) I. Let P. The case is similar to the previous one. Let Z t[re] P. It was generated by the rule in Fig. 14(c) on the basis of an association R with name (R ) λ. From the same rule, (Z, child (R )) I. The regular expression re was generated by the rules specifying the formal semantics of the rewrite-down function. The semantics of function rewrite-down is designed directly according to the formal semantics of the relation. Therefore, re X. (a) Named association rewriting rule (b) Unnamed association rewriting rule (c) Attribute rewriting rule (d) Sequence content model rewriting rule (e) Choice content model rewriting rule (f) Set content model rewriting rule (g) Class rewriting rule Fig. 15. Formal semantics of rewrite-down function.

66 22 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 (a) PSM schema (b) Translation into regular tree grammar (c) Interpretation Fig. 16. PSM schema and its translation to regular tree grammar. I is surjective for the range S c names S a because a non-terminal is generated and mapped for each attribute and named association going to a class. No other terminals in addition to the mapped ones were generated and, therefore, I is total. Example 6.4. Translation of a PSM schema to a regular tree grammar is demonstrated in Fig. 16. Fig. 16(a) shows a PSM schema and Fig. 16(b) shows the result of translation by the function psm-2-rtg. The translation also generates an interpretation depicted in Fig. 16(c). Note that the translation function is straightforward. In the previous example, a more advanced transformation function could unify the non-terminals Name1 and Name2 or Phone1 and Phone2 since they have the same production rules. This is a kind of optimization which could be expressed as a set of optimization rules similar to Section 6.2. However, this is beyond the scope of this paper Translating a regular tree grammar to a PSM schema Second, we show how regular tree grammar G may be translated to a normalized PSM schema S modeling G. We introduce a function rtg-2-psm which takes G as an input and outputs S. Its formal semantics is specified by a single rule in Fig. 17. The rule specifies that each initial non-terminal Z is rewritten to an association with the schema class as a parent. To rewrite Z,a function rewrite-up is applied. The formal semantics of rewrite-up is specified by rules depicted in Fig. 18. The upper part of each rule specifies a part of the grammar while the lower part determines constructed components of the resulting PSM schema. The function always returns an association. In particular, the rule in Fig. 18(a) rewrites an XML element declaration with simple content Z to an attribute A. A is put to a new auxiliary class C and an association R going to C is returned. The rule in Fig. 18(b) works similarly but for XML attributedeclarations.theruleinfig. 18(c) rewrites an XML element declaration with complex content Z to a class C. C has empty attributes and the only association R 1 in its content which is the result of rewriting the regular expression of the production rule of Z. The rule returns a new association R with C as its child. The rule in Fig. 18(d) rewrites a regular expression in a form of a cardinality constraint re 1 m..n to an association R which is the result of rewriting re 1, andaugments it with the cardinality m..n. The rules in Fig. 18(e), (f) and (g) rewrite a sequence, choice and set model re to a new sequence, choice or set content model M,respectively. The content of M are associations which are results of rewriting parts of re. The rule returns a new association R with M as its child. The rewriting rules specifying the formal semantics of rewrite-up also generate a mapping I of non-terminals in G to components of S. In particular the rules in Fig. 18(a) and (b) generate interpretations of attributes, while the rule in Fig. 18(c) generates interpretations of classes. The following theorem shows that I is an interpretation of G against S. In other words, the translation algorithm is correctly defined. Theorem 6.3. Let G be a regular tree grammar. Let S be a PSM schema and I be a mapping of non-terminals of G to S which are the result of applying function rtg-2-psm on G. Then I is an interpretation of G against S. Proof. We show that for each non-terminal Z of G, I (Z) meets the conditions of Definition 5.3. In other words, we show that I is an interpretation of G against S. In particular: Let Z S. Then from the rules in Figs. 17 and 18(c) we get R with parent (R )=C and child (R )=C s.t. C S S c (Z,C ) I. Let (Z t[d]) P. Then from the rule in Fig. 18(a) we get that if (Z,A ) I, then A S a, name (A ) =t, type (A ) =D and xform (A ) = e. Fig. 17. Formal semantics of rtg-2-psm translation function.

67 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) (a) XML element declaration with simple content rewriting rule (b) XML attribute declaration rewriting rule (c) XML element declaration with complex content rewriting rule (d) Cardinality rewriting rule (e) Sequence model rewriting rule (f) Choice model rewriting rule (g) Set model rewriting rule Fig. 18. Formal semantics of rewrite-up function. Let t[d]) P. The case is similar to the previous one. Let (Z t[re]) P. Then from the rule in Fig. 18(c) we find that if (Z,C ) I, then C S c. The rule also created R S r s.t. child (R ) =C name (R ) =t. We also have that re C. This can be proved easily since the rewriting rules are directly constructed according to the definition of relation. I is total because an attribute or class with a parent named association is generated for each XML element or XML attribute declaration and the respective mapping is added to I. No other attributes and classes in addition to these with mapped nonterminals were generated and, therefore, I is surjective for the range S c names S a. The function rtg-2-psm generates a normalized PSM schema. However, the schema is far from optimal. The content of each class contains only a single association which goes to a sequence content model. Also, there is a separate class created for each attribute. Therefore, we consider the application of optimization rules on the generated PSM schema as introduced in Section 6.1. Example 6.5. Suppose the regular tree grammar depicted in Fig. 19(a). The PSM schema generated by the function rtg-2-psm is depicted in Fig. 19(b). It can be seen that the resulting PSM schema is not very transparent. To optimize its structure, we apply the optimization rules depicted in Fig. 12(a) and (b). The result of optimization is depicted in Fig. 19(c). Fig. 19(d) shows the created interpretation of the grammar against both PSM schemas. 7. Implementation The results presented in this paper are implemented in a new tool exolutio. 3 It is available both as a desktop and a browserbased application. It fully implements the formal model presented in Definitions 4.1 and 4.2. It currently supports working with PIM and PSM schemas in the top-down direction as we described in detail in Section 4.3. When a designer creates a PSM schema on the basis of a PIM schema, its interpretation against the PIM schema is created and further maintained. The tool prevents from changes which would break conditions given by Definition 5.2. exolutio also supports normalization and optimization of PSM schemas as described in Sections 6.1 and 6.2. It also implements the algorithms for translating PSM schemas to regular tree grammars as described in Section 6.3. In addition, the tool implements other algorithms which were not presented in this paper. This includes mainly maintaining and augmenting PIM and PSM schemas and interpretations when a designer performs various changes. We call this process adaptation of PIM and PSM schema. However, this is beyond the scope of this paper. We tested the implementation on two different sets of XML schemas. The former ones are XML schemas of the National Register for Public Procurement (NRPP) (see Section 2). We kept its PIM schema simple for basic testing purposes. It has 4 classes interconnected by 9 associations. The vocabulary comprises 17 XML schemas each modeled by a separate PSM schema. The latter are XML 3

68 24 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 (a) Regular tree grammar (b) PSM schema (c) Optimized PSM schema (d) Interpretation Fig. 19. PSM schema and its translation to regular tree grammar. schemas used for communication among healthcare information systems in the Czech Republic, e.g., hospitals, laboratories, insurance companies, etc. It has much more complicated PIM schema comprising 54 classes and around 100 associations. The vocabulary comprises more than 20 XML schemas. A screenshot of our tool is depicted in Fig. 20. It depicts the health-care PIM schema and two of its PSM schemas. We tested the forward-engineering direction of designing XML schemas on both scenarios. The experiments showed that the tool significantly accelerates the design of XML schemas and prevents from design errors caused by modeling the application domain differently in different XML schemas. In each designer's step at the PSM level (e.g. creating a new class or attribute or removing it), the tool checks whether interpretation conditions (Definition 5.2) are not violated. In theearly steps, it is necessary to create the PIM schema. This requires an additional work. However, it is compensated in the later steps when deriving particular PSM schemas of the XML schemas from the PIM schemas. In both cases, our approach significantly improved the readability of both sets XML schemas. Now, a designer can easily see what part of the application domain (i.e. PIM schema) corresponds to a selected part of an XML schema. And, conversely, (s)he may easily see where in the XML schemas a selected component of the PIM schema is represented. 8. Related work As we have already discussed, there exist various XML schema languages, e.g. DTD, XSD or RELAX NG, for specification of XML schemas. There are also formal approaches based on regular tree grammars [61]. However, these approaches are not very userfriendly. Therefore, various approaches for designing XML schemas at a more abstract, conceptual level were introduced. In comparison to our approach, none of these other approaches considers a formal binding between XML schemas and conceptual schemas. They only show how a conceptual schema is translated to an XML schema or vice versa, but they do not prove that the translation is correct and that the conceptual schema specifies the XML schema unambiguously. They also do not consider the case when more XML schemas need to be designed which was the motivation for our work.

M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 25 Fig. 20. exolutio screenshot. Note that our work can be naturally extended with other recent approaches.

69 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) Fig. 20. exolutio screenshot. Note that our work can be naturally extended with other recent approaches. For example, [4] proposes an approach to designing distributed XML documents, i.e. XML documents which span multiple machines. Various methods from the literature can be used to ensure quality of the created PIM and PSM schemas [16,12]. There are also works which study specifics of quality of designed XML schemas, e.g. rules for naming and design rules (NDR) [17] Top-down approaches The top-down approaches are based on first designing a conceptual schema and then its translation to one or more XML schemas. Current approaches, however, consider only automatic translation to a single XML schema. This is different from our novel approach which is based on interaction with a designer to derive several different XML schemas from the conceptual schema according to the needs of the system. We can divide the top-down approaches into three groups. First, there are approaches based on the ER model (e.g. [5,11,31,34,36,50,51,54,55,58,72,76]). For their detailed overview we refer to our survey [63] or other surveys [21,42,88]. These approaches propose extending entity and relationship types and accordingly translate ER diagrams to corresponding XML schemas. Their common characteristic is that they force a designer to model an XML schema directly in the ER model. This can be a disadvantage, because the designer must concentrate on XML specific implementation details at the conceptual level. Another problem is that these approaches consider the design of only a single separate XML schema, but not a set of XML schemas that describe XML representations of the same data in different types of XML documents. If we need to design such a set, we must create a separate conceptual diagram for each XML schema in the set. The result is a set of conceptual diagrams that are not explicitly interrelated. Second, there are approaches based on the model of UML class diagrams (e.g. [15,26,30,48,62,75,77]). Their detailed surveys may be found in [14,29,42]. [14] concentrates on representing constructions of XML Schema in UML class diagrams, whereas [29] concentrates on translating UML class diagrams to XML schemas. [42] concentrates on a broad range of techniques for achieving metadata interoperability. The UML-based approaches usually follow the MDD concept. First, a conceptual diagram in a PIM is designed. It provides an abstract description of the data independently of its representations in target data models such as relational or XML. Second, a diagram in a PSM is designed. It specifies how the components of the PIM diagram are represented in XML. From the PSM diagram, an XML schema is derived. As PIM, UML class diagrams are used; as PSM, UML class diagrams extended with a suitable UML profile are used. The UML profile is composed of stereotypes that allow us to model XML

70 26 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 representations of the data. The approaches propose stereotypes intended for a certain XML schema language, mainly XML Schema (e.g. [15,30,62,75]). However, the problem is that they consider an automatic translation of a PIM diagram to a PSM diagram. The goal is to derive an optimal XML representation of the data, but these approaches do not specify what does the word optimal mean. And, there are only very restricted possibilities to influence the resulting XML representation. They do not consider deriving multiple XML representations of the data modeled by the PIM diagram that was required in our motivating example. Therefore, their modeling power is similar to the ER-based approaches. Third, there are approaches which introduce their own conceptual model, e.g., semantic networks for XML [74], hypergraphs [35], or ORA-SS[28]. Similar to other approaches, an ORA-SS conceptual schema only allows for designing one particular XML schema. However, an important contribution of ORA-SS is a possibility to represent n-ary relationships (i.e. relationships with more than 2 participants). We extended this approach in our previous work[64,70]. In[83], a visual language XCDL based on Colored Petri Nets is proposed. The language is not meant for conceptual modeling but for composing XML data by non-experts. Recently, there has also appeared approaches which use ontologies for XML modeling. For example, [90] studies reasoning methods for XML using ontologies and [13,85] study conversion of XML schemas to ontologies. Also approaches based on Object-Role Modeling (ORM) [38] exist [56] Bottom-up approaches The bottom-up approaches consider existing XML schemas and use them to derive a conceptual schema. Once again there are approaches considering the ER model [34,71] or UML class diagrams [44,87]; a recent survey of them can be found in [89]. These approaches, however, do not consider the fact that there may exist multiple different XML schemas and a single common conceptual schema is necessary. Moreover, they also do not consider that a conceptual schema may exist and needs to be augmented according to a new set of XML schemas. Therefore, if we apply these approaches on a set of XML schemas, we get a set of separate conceptual schemas, each being the result of an automated reverse engineering of the respective XML schema. These UML class diagrams are not interrelated; neither mutually, nor with an existing conceptual schema. [71] extends these approaches with a possibility to define mapping rules. A designer may, e.g., specify that all elements address from a given set of XML schemas are mapped to the same concept in the conceptual schema. However, the approach does not provide any automation of this process. In our previous work [47,66] we have introduced a method which allows for semi-automatic derivation of a conceptual schema from a given set of XML schemas. The method only suggests derivations, but the final decision must be made by a human expert General schema mapping approaches If we look at our target problem from a general point of view, we can find many similarities to general schema mapping approaches. In particular, we have two heterogeneous schemas and we need to find a mapping between them. Also, our motivations (see Section 4.2) are very close to motivations of general schema mapping. The number of approaches in this area is enormous ([32,33]), even if we restrict ourselves only to XML schema-related mapping [6,7,9,23,80]. Consequently, there exist theoretical studies [79], taxonomizations [73], as well as efficiency evaluations [27] of the approaches. [52] discusses not only the problem of matching but also query rewriting between an integrated global schema and original local schemas. In general these can be classified variously. From our point of view we may distinguish between approaches that mutually map the input schemas directly and those that exploit a common general schema. Our approach is closer to the latter group, where our PIM model can serve as the common schema. All the approaches are based on the idea of exploitation of a similarity measure which is used to find the related parts of the schemas to be mapped. From one point of view our approach can be viewed as a special use case since we consider only two specific types of schemas. However, our aim is much more complex, since we are not looking for similar concepts, but those that mutually represent each other directly. And, what is more, since our key aim is the subsequent correct management of the evolution of the system, we need to define the mapping rigorously, so that change propagation can be done correctly. Hence, we need the precise formal background as described before. On the other hand, the matching approaches can be exploited in the process of reverse-engineering as an aid for the user in finding the correct mapping [47]. Other approaches focus on mapping and distribution of data and documents at the extensional level. An example can be [45], a paper which deals with bidirectional mapping and data update propagation in an environment with multiple nodes with possibly different schemas. This is, however, beyond the scope of this paper. 9. Open problems Despite the detailed descriptions used in this paper, there are still a number of open problems and issues to be solved. Some of them are closely related, and some come from slightly different but still related areas that we consider as important for our current and further research. A list of the key issues can be specified as follows: Advanced XML Schema Constructs: As mentioned in Section 3, we did not cover all the constructs of XML schemas. From among the less important aspects let us mention mixed-content elements, elements with simple content and attributes or a root element with a simple content. A more important aspect is the wide support of the inference of complex types in

71 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) XML Schema using restriction or extension, i.e. type inheritance [37]. From the syntactic point of view it can be considered as a syntactic sugar. However, from the point of view of conceptual modeling it brings a new and strong concept that bears important information to be modeled. Hand in hand with inheritance at the PSM level goes inheritance support at the PIM level, where we can adopt standard inheritance constructs present in the UML class diagrams model. Evolution of XML Schemas: An important though often omitted aspect of XML applications is the fact that in most cases user requirements sooner or later change. From our point of view of data modeling this means that any of the XML schemas may change and such change needs to be propagated to all the affected parts of the system. Most of the current approaches [20,30,46,78,82] focus on propagation of the changes from the XML schema (or its model) to the respective XML documents. In this area we have recently proposed an approach [53] for automatic generation of respective XSLT scripts. But, as we have indicated in our motivating scenarios, a typical XML application involves a number of related XML schemas. So, such a change needs to be propagated to all of them as well. Our preliminary results in this area can be found in [68,69], where we propose a five-level evolution framework, respective operations and the propagation of changes they trigger while considering a set of related XML schemas. Versioning of XML Schemas: And even more complex issue than the evolution of XML schemas, i.e. the situation when we need to adapt the system to the required changes, is versioning. In this area we consider a situation when a current version of a schema is modified, but we need to be able to preserve both (or all) the versions, since different versions are used by the different users. Consequently, the required modification and transformations must be kept in the system and activated as needed. Currently, there exist several approaches [49,59,60] trying to solve this issue; however, they consider only a single schema having a number of versions, not a system of related schemas as we consider in this paper. Reverse-Engineering: A situation when we are provided with a newly coming XML schema can be considered as a special case of evolution. It can be automatically transformed to the respective PSM; however, the problem is how to find and construct its interpretation against PIM. In [47,66] we have proposed a partial solution. However, the issue has not yet been fully solved, since the interpretation cannot be found automatically and efficiently. We showed that the efficiency of the search algorithm is strongly influenced by the size of the respective PIM schema and that in complex cases the whole space containing solutions cannot be searched in an acceptable time. In addition, to find a reasonable and natural interpretation, we need to exploit user interaction, adapt the search process to the given information, etc. Integrity Constraints: In this paper we have considered just the structural aspect of XML data modeling. However, apart from the structure of the XML tree, we can also express various integrity constraints (ICs). The most common type of ICs are keys and foreign keys. In DTD they are expressed using simple data types ID and IDREF(S) and are valid in the context of the whole document. In XML Schema we are provided with constructs unique, key, and keyref which enable one to specify the context/scope of the constraint. Apart from that, XML Schema involves new constructs assert and report that correspond to Schematron rules and enable us to express more complex conditions. Basic foundations and classifications of XML keys and discussion of the related decision problems can be found in [19], satisfiability of specification of keys is studied in [10,39]. Other kinds of XML integrity constraints are studied in [40,41]. Similar to the case of structural constraints discussed in this paper, we could focus on the same aspects with similar advantages as in the case of integrity constraints. We use UML at PIM level and UML uses Object Constraint Language (OCL) [1] to define general constraints. In a similar manner as we extended UML for the PSM level, we will extend OCL for modeling constraints at the PSM level too. We can then propose a transformation algorithm and, hence, provide a formal basis for change propagation of ICs. Modeling of Standards: An important use case of our approaches is the modeling of XML standards for XML data exchange. There are several problems which we can help to find solutions for. First, XML standards tend to be quite enormous, resulting in very large and messy XML schemas, which are not easy to understand. Using our conceptual model, it is possible to break down large schemas into separate PSM schemas while maintaining connections between them, making it easier to navigate and understand. The second problem is change management of standards. Today, when a standard changes, the changes are described in a text file and every user needs to adopt the changes manually. Currently, we aim at creating a language for description of changes in XML standards so that a user can adopt the changes at least semi-automatically. 10. Conclusions In this paper, we have identified several problems present in the current approaches to conceptual modeling for XML. From the practical point of view, current approaches do not consider the fact that various XML schemas are usually applied in a single information system. This leads to designing of a separate conceptual schema for each XML schema. This is, however, not very practical. From the theoretical point of view, current approaches lack a strong formal basis. It does not enable their authors to prove the correctness of their proposed conceptual models. To deal with these problems, we applied the strategy of MDD. We showed that a conceptual schema in a platform-independent model (PIM schema) may be designed and the XML schemas may then be derived from the PIM schema in the form of conceptual schemas in a platform-specific model (PSM schemas). We showed various advantages of such approach. Naturally, there have already appeared approaches based on MDD principles. However, none of them have considered the possibility of designing more PSM schemas on the basis of a single common PIM schema. The main contribution of this paper is the proposed formalism built on top of regular tree grammars. We first bound the PIM schema with the PSM schemas and a PSM schema with a regular tree grammar modeled by the PSM schema. We called these bindings interpretations. On this basis we proved that the introduced conceptual model is correct, i.e. that a PSM schema in this model specifies an XML schema unambiguously. We also introduced algorithms for translating PSM schemas to regular tree

72 28 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 grammars and vice versa and we showed that they are correct, i.e. that an interpretation always exists. For the purposes of the translation we introduced the notion of PSM schema normalization which hides parts of the PSM schema that cannot be reflected in the modeled regular tree grammar. We also discussed the problem of PSM schema optimization, i.e. how a PSM schema may be adjusted so that it is more transparent for a human designer but still models the same regular tree grammar. Despite the level of detail of this paper, we did not solve all the related problems; the key open areas are discussed in Section 9. In our current work we focus mainly on full implementation of the model and the related approaches, mainly in the area of XML schema evolution, that we have proposed so far. In our next paper we will focus primarily on the indicated open issues, in particular an extension of the model and related areas with inheritance and integrity constraints which form a key and, due to its complexity, currently omitted aspect. Acknowledgment This work was supported by the Czech Science Foundation (GAČR), grant numbers P202/11/P455, 201/09/P364 and P202/10/0573. References [1] Object Constraint Language Specification, version 2.2, Object Management Group, URL [2] UML Infrastructure Specification 2.4, Object Management Group, URL [3] UML Superstructure Specification 2.4, Object Management Group, URL [4] S. Abiteboul, G. Gottlob, M. Manna, Distributed XML Design, The Computing Research Repository (CoRR) abs/ , [5] R. Al-Kamha, D.W. Embley, S.W. Liddle, Augmenting traditional conceptual models to accommodate XML structural constructs, Proceedings of 26th International Conference on Conceptual Modeling, Springer, Auckland, New Zealand, 2007, pp [6] A. Algergawy, R. Nayak, G. Saake, Element similarity measures in XML schema matching, Information Sciences 180 (24) (2010) [7] A. Algergawy, E. Schallehn, G. Saake, Improving XML schema matching performance using Prüfer sequences, Data & Knowledge Engineering 68 (8) (2009) [8] Altova Inc., XML Spy URL [9] S. Amano, L. Libkin, F. Murlak, XML schema mappings, 28th Symposium on Principles of Database Systems (PODS), ACM, 2009, pp [10] M. Arenas, W. Fan, L. Libkin, On the complexity of verifying consistency of XML specifications, SIAM Journal on Computing 38 (2008) [11] A. Badia, Conceptual modeling for semistructured data, Proceedings of the 3rd International Conference on Web Information Systems Engineering Workshops, IEEE, Singapore, 2002, pp [12] C. Batini, D. Barone, C. Federico, S. Grega, A data quality methodology for heterogeneous data, International Journal of Database Management Systems 3 (1) (2011) [13] I. Bedini, G. Gardarin, B. Nguyen, Deriving ontologies from XML schema, The Computing Research Rep. (CoRR) abs/ , [14] M. Bernauer, G. Kappel, G. Kramler, Representing XML Schema in UML A Comparison of Approaches, in: Web Engineering, Lecture Notes in Computer Science, [15] M. Bernauer, G. Kappel, G. Kramler, Representing XML Schema in UML An UML Profile for XML Schema, Tech. Rep. November 2003, Department of Computer Science, National University of Singapore, [16] N. Bolloju, F.S.K. Leung, Assisting novice analysts in developing quality conceptual models with UML, Communications of the ACM 49 (7) (2006) [17] N. Bolloju, F.S.K. Leung, A framework for XML schema naming and design rules development tools, Computer Standards & Interfaces 32 (4) (2010) [18] T. Bray, J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau, Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C URL, TR/xml/. [19] P. Buneman, S. Davidson, W. Fan, C. Hara, W.-C. Tan, Reasoning about keys for XML, Information Systems 28 (2003) [20] F. Cavalieri, G. Guerrini, M. Mesiti, Updating XML schemas and associated documents through EXUP, 27th International Conference on Data Engineering, IEEE Computer Society, 2011, pp [21] H. Chen, H. Liao, A survey to conceptual modeling for XML, 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), IEEE Computer Society, Washington, DC, USA, 2010, pp [22] P. Chen, The entity-relationship model toward a unified view of data, ACM Transactions on Database Systems 1 (1) (1976) [23] R. Cheng, J. Gong, D.W. Cheung, Managing uncertainty of XML schema matching, 26th International Conference on Data Engineering (ICDE), IEEE, 2010, pp [24] J. Clark, M. Makoto, RELAX NG Specification, Oasis URL, December [25] C. Combi, B. Oliboni, Conceptual modeling of XML data, Proceedings of the 2006 ACM symposium on Applied computing, SAC'06, ACM, New York, NY, USA, 2006, pp [26] R. Conrad, D. Scheffner, J. Christoph Freytag, XML Conceptual Modeling Using UML, in: Conceptual Modeling ER 2000, Lecture Notes in Computer Science, [27] H. Do, S. Melnik, E. Rahm, Comparison of schema matching evaluations, Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems, Springer-Verlag, London, UK, 2003, pp [28] G. Dobbie, T.W. Ling, Normal form ORA-SS schema diagrams, Encyclopedia of Database Systems, Springer, US, 2009, pp [29] E. Dominguez, J. Lloret, B. Perez, A. Rodriguez, A. Rubio, M. Zapata, A Survey of UML Models to XML Schemas Transformations, in: Web Information Systems Engineering WISE 2007, Lecture Notes in Computer Science, [30] E. Dominguez, J. Lloret, B. Perez, A. Rodriguez, A.L. Rubio, M.A. Zapata, Evolution of XML schemas and documents from stereotyped UML class models: a traceable approach, Information and Software Technology 53 (1) (2011) [31] R. Elmasri, Q. Li, J. Fu, Y.-C. Wu, B. Hojabri, S. Ande, Conceptual modeling for customized XML schemas, Data & Knowledge Engineering 54 (1) (2005) st International Conference on Conceptual Modeling. [32] J. Euzenat, et al., Ontology alignment evaluation initiative: six years of experience, Journal on Data Semantics 15 (2011) [33] J. Euzenat, P. Shvaiko, Ontology Matching, Springer, Berlin, Heidelberg, [34] J. Fong, S.K. Cheung, H. Shiu, The XML tree model toward an XML conceptual schema reversed from XML schema definition, Data & Knowledge Engineering 64 (3) (2008) [35] J. Fong, W. Mok, H. Li, Design Non-recursive and Redundant-Free XML Conceptual Schema with Hypergraph (Extended Abstract), in: Database Systems for Adanced Applications, Lecture Notes in Computer Science, [36] M. Franceschet, D. Gubiani, A. Montanari, C. Piazza, From entity relationship to XML schema: a graph-theoretic approach, Eighteenth Italian Symposium on Advanced Database Systems SEBD, Esculapio Editore, 2010, pp

73 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) [37] S. Gao, C.M. Sperberg-McQueen, H.S. Thompson, W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures, W3C, July URL org/tr/xmlschema11-1/. [38] T. Halpin, T. Morgan, Information Modeling and Relational Databases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [39] S. Hartman, S. Link, Expressive, yet tractable XML keys, 12th International Conference on Extending Database Technology, vol. 360 of ACM International Conference Proceedings Series, ACM, 2009, pp [40] S. Hartman, S. Link, Numerical constraints on XML data, Information Computing 208 (5) (2010) [41] S. Hartman, S. Link, T. Trinh, Solving the implication problem for XML functional dependencies with properties, 17th International Workwhop on Logic, Language, Information and Computation (WoLLIC), vol of Lecture Notes in Computer Science, Springer, 2010, pp [42] B. Haslhofer, W. Klas, A survey of techniques for achieving metadata interoperability, ACM Computing Surveys 42 (2) (2010) [43] ISO, Information Technology Document Schema Definition Languages (DSDL) Part 3: Rule-based Validation Schematron, ISO/IEC , Feb [44] M.R. Jensen, T.H. Møller, T.B. Pedersen, Converting XML DTDs to UML diagrams for conceptual data integration, Data & Knowledge Engineering 44 (3) (2003) [45] G. Karvounarakis, Z.G. Ives, Bidirectional mappings for data and update exchange, WebDB, [46] M. Klettke, Conceptual XML Schema Evolution The CoDEX approach for design and redesign, BTW'07, Aachen, Germany, 2007, pp [47] J. Klímek, M. Nečaský, Semi-automatic integration of web service interfaces, ICWS, IEEE Computer Society, 2010, pp [48] T. Krumbein, T. Kudrass, Rule-based generation of XML schemas from UML class diagrams, Berliner XML Tage, 2003, pp [49] Philipp Leitner, Anton Michlmayr, Florian Rosenberg, Schahram Dustdar, "End-to-End Versioning Support for Web Services," Services Computing, IEEE International Conference on, pp , 2008 IEEE International Conference on Services Computing Vol. 1, [50] B. Loscio, A. Salgado, L. Galvao, Conceptual modeling of XML schemas, Proceedings of the Fifth ACM CIKM International Workshop on Web Information and Data Management, New Orleans, USA, 2003, pp [51] Z. Ma, L. Yan, Fuzzy XML data modeling with the UML and relational data models, Data & Knowledge Engineering 63 (3) (2007) [52] S.K. Madria, K. Passi, S.S. Bhowmick, An XML Schema integration and query mechanism system, Data & Knowledge Engineering 65 (2) (2008) [53] J. Malý, J. Klímek, I. Mlýnková, M. Nečaský, XML document versioning and revalidation, DATESO, 2011, pp [54] M. Mani, Semantic data modeling using XML schemas, Proceedings of the 20th International Conference on Conceptual Modeling, Springer, Yokohama, Japan, 2001, pp [55] M. Mani, EReX: a conceptual model for XML, Proceedings of the Second International XML Database Symposium, Toronto, Canada, 2004, pp [56] I. Melleri, Using Object Role Modeling in a Service-Oriented Data Integration Project, in: On the Move to Meaningful Internet Systems: OTM 2010 Workshops, Lecture Notes in Computer Science, [57] J. Miller, J. Mukerji, MDA Guide Version 1.0.1, Object Management Group(2003). URL [58] W.Y. Mok, J. Fong, D.W. Embley, Extracting a largest redundancy-free XML storage structure from an acyclic hypergraph in polynomial time, Information Systems 35 (2010) [59] H.J. Moon, C.A. Curino, A. Deutsch, C.-Y. Hou, C. Zaniolo, Managing and querying transaction-time databases under schema evolution, The Proceedings of the VLDB Endowment 1 (1) (2008) [60] H.J. Moon, C.A. Curino, M. Ham, C. Zaniolo, PRIMA: archiving and querying historical data with evolving schemas, SIGMOD'09: Proceedings of the 35th SIGMOD international conference on Management of data, ACM, New York, NY, USA, 2009, pp [61] M. Murata, D. Lee, M. Mani, K. Kawaguchi, Taxonomy of XML schema languages using formal language theory, ACM Transactions on Internet Technology 5 (4) (2005) [62] K. Narayanan, S. Ramaswamy, Specifications for mapping UML models to XML, Proceedings of the 4th Workshop in Software Model Engineering, Montego Bay, Jamaica, [63] M. Nečaský, Conceptual modeling for XML: a survey, Dateso'08, CEUR-WS, Vol. 176, 2006, pp [64] M. Nečaský, XSEM a conceptual model for XML, APCCM'07, ACS, CRPIT 67, Ballarat, Australia, 2007, pp [65] M. Nečaský, Conceptual Modeling for XML, vol. 99 of Dissertations in Database and Information Systems Series, IOS Press/AKA Verlag, [66] M. Nečaský, Reverse engineering of XML schemas to conceptual diagrams, APCCM'09, Australian Computer Society, Wellington, New Zealand, 2009, pp [67] M. Nečaský, I. Mlýnková, When conceptual model meets grammar: a formal approach to semi-structured data modeling, WISE, 2010, pp [68] M. Nečaský, I. Mlýnková, Five-level multi-application schema evolution, Proceedings of the Databases, Texts, Specifications, and Objects, DATESO'09, MatfyzPress, April 2009, pp [69] M. Nečaský, I. Mlýnková, A framework for efficient design, maintaining, and evolution of a system of XML applications, Proceedings of the Databases, Texts, Specifications, and Objects, DATESO'10, MatfyzPress, April 2010, pp [70] M. Nečaský, J. Pokorný, Conceptual modeling of IS-A hierarchies for XML, EJC'08, IOS Press, Tsukuba, Japan, 2008, pp [71] G.D. Penna, A.D. Marco, B. Intrigila, I. Melatti, A. Pierantonio, Interoperability mapping from XML schemas to ER diagrams, Data & Knowledge Engineering 59 (1) (2006) [72] G. Psaila, ERX: A conceptual model for XML documents, Proceedings of the 2000 ACM Symposium on Applied Computing, ACM, Como, Italy, 2000, pp [73] E. Rahm, P.A. Bernstein, A survey of approaches to automatic schema matching, The VLDB Journal 10 (4) (2001) [74] R. Rajugan, E. Chang, L. Feng, T.S. Dillon, Modeling dynamic properties in the layered view model for XML Using XSemantic Nets, Advanced Web and Network Technologies, and Applications Workshops (APWeb), vol of Lecture Notes in Computer Science, Springer, 2006, pp [75] N. Routledge, L. Bird, A. Goodchild, UML and XML schema, Proceedings of 13th Australasian Database Conference (ADC 2002), ACS, [76] R. Schroeder, R. Mello, Designing XML documents from conceptual schemas and workload information, Multimedia Tools and Applications 43 (2009) [77] A. Sengupta, S. Mohan, R. Doshi, XER extensible entity relationship modeling, Proceedings of the XML 2003 Conference, Philadelphia, USA, 2003, pp [78] M. Shoaran, A. Thomo, Evolving schemas for streaming XML, Theoretical Computer Science 412 (35) (2011) [79] M. Smiljanic, M. van Keulen, W. Jonker, Formalizing the XML Schema Matching Problem as a Constraint Optimization Problem, DEXA, vol of lecture notes in computer science, Springer, 2005, pp [80] S. Sorrentino, S. Bergamaschi, M. Gawinecki, L. Po, Schema normalization for improving schema matching, ER'09, Springer, 2009, pp [81] Sparx Systems, Enterprise Architect v.8. URL [82] H. Su, D. Kramer, L. Chen, K. Claypool, E.A. Rundensteiner, XEM: managing the evolution of XML documents, Proceedings of the 11th International Workshop on research Issues in Data Engineering, IEEE Computer Society, Washington, DC, USA, 2001, pp [83] J. Tekli, R. Chbeir, J. Fayolle, XCDL: an XML-oriented visual composition definition language, 12th International Conference on Information Integration and Web-based Applications and Services, ACM, 2010, pp [84] J. Tekli, R. Chbeir, K. Yetongnon, Extensible user-based XML grammar matching, 28th International Conference on Conceptual Modeling, vol of Lecture Notes in Computer Science, Springer, 2009, pp [85] P.T.T. Thuy, Y.-K. Lee, S. Lee, DTD2OWL: automatic transforming XML documents into OWL ontology, ICIS'09, ACM, New York, NY, USA, 2009, pp [86] W3C OWL Working Group, OWL 2 Web Ontology Language Document Overview, W3C, October URL [87] W. Yang, N. Gu, B. Shi, Reverse engineering XML, IMSCCS, 2, 2006, pp [88] A. Yu, An overview of research on reverse engineering XML schemas into UML diagrams, ICITA'05 Volume 2 Volume 02, ICITA'05, IEEE Computer Society, Washington, DC, USA, 2005, pp [89] A. Yu, R. Steele, An overview of research on reverse engineering XML schemas into UML diagrams, ICITA, 2, 2005, pp [90] F. Zhang, L. Yan, Z.M. Ma, Knowledge representation and reasoning of XML with ontology, Proceedings of the 2011 ACM Symposium on Applied computing, ACM, 2011, pp

degree in Computer Science in 2008 from the Charles University in Prague, Czech Republic, where he currently works at the Department of Software Engineering as an assistant professor.

His research areas involve XML data design, integration and evolution. He is an organizer/pc chair/member of more than 10 international events.

degree in Computer Science in 2007 from the Charles University in Prague, Czech Republic.

74 30 M. Nečaský et al. / Data & Knowledge Engineering 72 (2012) 1 30 Martin Nečaský received his Ph.D. degree in Computer Science in 2008 from the Charles University in Prague, Czech Republic, where he currently works at the Department of Software Engineering as an assistant professor. He is an external member of the Department of Computer Science and Engineering of the Faculty of Electrical Engineering, Czech Technical University in Prague. His research areas involve XML data design, integration and evolution. He is an organizer/pc chair/member of more than 10 international events. He has published more than 30 papers (two received Best Paper Award). He has published 3 book chapters and a book. Irena Mlýnková received her Ph.D. degree in Computer Science in 2007 from the Charles University in Prague, Czech Republic. She is an assistant professor there and an external member of the Department of Computer Science and Engineering of the Czech Technical University. She has published more than 40 publications (16 recorded in WoS), 4 gained the Best Paper Awards. Her research areas involve management of XML data, structural similarity, analysis of real-world data, synthesis of XML data, XML benchmarking, XML schema inference and application evolution. She is a PC member/reviewer of 14 international events and co-organizer of 3 international workshops. Jakub Klıḿek received his Master s degree in Computer Science in September 2009 from the Charles University in Prague, Czech Republic, where he currently is a Ph.D. student at the Department of Software Engineering. His research areas involve XML data design, integration and evolution. He has published 7 refereed conference papers. He is a co-organizer of 1 local workshop. Jakub Malý received his Master s degree in Computer Science in June 2010 from the Charles University in Prague, Czech Republic, where he currently is a Ph.D. student at the Department of Software Engineering. His research areas involve conceptual modeling of XML data and evolution of XML applications. He has published 2 refereed conference papers.

Chapter 3 Translation of Structural Constraints from Conceptual Model for XML to Schematron Soběslav Benda Jakub Klímek Martin Nečaský Extended

Australia, January-February 2013. CRPIT 143, Australian Computer Society 2013, ISBN 978-1-921770-28-9. Pages 51 60.

75 Chapter 3 Translation of Structural Constraints from Conceptual Model for XML to Schematron Soběslav Benda Jakub Klímek Martin Nečaský Extended version of a paper which was published in the Proceedings of the Nith Asia Pacific Conference on Conceptual Modelling (APCCM 2013), Adelaide, South Australia, January-February CRPIT 143, Australian Computer Society 2013, ISBN Pages The extended version has been submitted to the Journal of Universal Computer Science and is now currently in the major revision status. Impact Factor: Year Impact Factor:

76 64

77 Translation of Structural Constraints from Conceptual Model for XML to Schematron Jakub Klímek (Charles University in Prague Faculty of Mathematics and Physics, Czech Republic Soběslav Benda (Charles University in Prague Faculty of Mathematics and Physics, Czech Republic Martin Nečaský (Charles University in Prague Faculty of Mathematics and Physics, Czech Republic Abstract: Today, XML (extensible Markup Language) is a standard for exchange inside and among IT infrastructures. For the exchange to work an XML format must be negotiated between the communicating parties. The format is often expressed as an XML schema. In our previous work, we introduced a conceptual model for XML, which utilizes modeling, evolution and maintenance of a set of XML schemas and allows schema designers to export modeled formats into grammar-based XML schema languages like DTD and XML Schema. However, there is another type of XML schema languages called rule-based languages with Schematron as their main representative. In our preceding conference paper [Benda et al.(2013)] we briefly introduced the process of translation from our conceptual model to Schematron. Expressing XML schemas in Schematron has advantages over grammar-based languages and in this paper, we describe the previously introduced translation in more detail with focus on structural constraints and how they are represented in Schematron. Also, we discuss the possibilities and limitations of translation from our grammar-based conceptual model to the rule-based Schematron. Key Words: XML schema, conceptual modeling, Schematron, translation Category: D.2.2, H Introduction XML has many applications in various IT infrastructures. When using XML, communication partners must agree on the used XML formats, i.e. which elements and attributes may be present, in which structure, etc. A specification of an XML format is an XML schema - a collection of rules which XML documents must satisfy. Programs that can automatically verify document validity -

78 adherence to its schema - are called validators. There is a number of declarative languages called XML schema languages used for description of schemas. The standardized schema languages are DTD, XML Schema and Relax NG. These languages have differences in some features (e.g. expressive power, syntax complexity, object-oriented design, etc). A common feature of these languages is their formal background where each of these languages represents a certain subset of Regular Tree Grammar (RTG), see [Murata et al.(2005)]. Commonly, we call these languages grammar-based schema languages or grammars for short. However, it is possible to express XML schemas in other languages that are not based on RTG. An example of such language is Schematron [Jelliffe(2001)]. Briefly, Schematron allows designers to describe schemas using XPath conditions, that are evaluated over a given XML document during validation. This brings interesting possibilities for the validation of XML documents. This paper is an extended version of [Benda et al.(2013)]. The extension is in the level of detail of description of the translation process and more extensive unified examples. 1.1 Motivation In our previous work [Nečaský et al.(2012b), Nečaský et al.(2012a)], we developed a methodology for modeling, evolution and maintenance of XML schemas using a multilevel conceptual model based on Model Driven Architecture (MDA) (see [Miller and Mukerji(2003)]). So far, we have only supported grammar based XML schema languages, because of their popularity due to understandable declarations and efficient validation. While it is true that for relatively simple schemas DTD will do and for more complex structures XML Schema will provide the necessary constructs, there are also drawbacks to these widely used languages. For example, when we validate documents using DTD or XML Schema, we usually get a simple valid/invalid statement as a result. In the more interesting case of invalidity, the validators usually return a built-in error message, which is often incomprehensible, misleading and does not provide means for quality diagnostics [Nálevka(2010)]. In addition, it is often not possible (or user-friendly) to pass them directly to the user interface. Regarding this diagnostic problem, Schematron schemas can help. Schematron is often described as a language for description of integrity constraints [Murata et al.(2005)], but it is more than that. Using Schematron, it is possible to describe most constraints that can be expressed by grammars. Moreover, it is possible to describe many additional details and even structural constraints that we can not express using grammar-like languages like XML Schema. In [Jelliffe(2007)], the authors identify the demand for Schematron-based solutions for XML schema management, which is another motivation for adding support for Schematron to our conceptual model. Finally,

79 when combined with the approach to express integrity constraints in the conceptual model [Malý and Nečaský(2012)], Schematron becomes a unified schema language for description of the structure and integrity constraints of XML documents and a framework for detailed diagnostics and error reporting. These advantages of using Schematron outweigh its main disadvantage, which is its verbosity and complexity, because it can be eased by the usage of our conceptual model for schema management. 1.2 Contributions In this paper, we describe our approach to using Schematron as an XML schema language in conceptual modeling for XML. The main contribution of our approach is that with Schematron, we are able to provide better and finer grained diagnostic outputs during validation of XML documents when compared to validation using XML Schema. Also, certain constructs that are not possible to represent in XML schema can be represented using Schematron. This paper is an extended version of our conference paper [Benda et al.(2013)]. The main contribution of this extension is the detailed description of translation of structural constraints present in XML schemas which we had to omit in the conference paper. 1.3 Outline The paper is organized as follows: In [Section 2], we introduce our conceptual model for XML. In [Section 3], we introduce the Schematron language. [Section 4] contains the main contribution of this paper, the translation from the conceptual model to Schematron schemas with emphasis on details for the translation of structural constraints. In [Section 5] we discuss related work, [Section 6] contains a brief overview of the implementation and evaluation and we conclude in [Section 7]. 2 Conceptual modeling of XML schemas In this section, we briefly introduce our conceptual model for XML. For its full description and comprehensive related work see [Nečaský et al.(2012b)]. It is based on two levels of abstraction. The Platform-Independent Model (PIM) models the problem domain independently of any target platform such as XML or relational databases. The Platform-Specific Model (PSM) then provides description of how a part of the problem domain is represented in the target platform, in our case XML. A PSM schema is therefore a description of an XML format. From a PSM schema, we can automatically create a representation of the format in a chosen XML schema language such as XML Schema, or, in the

80 case of this paper, Schematron. The main feature of the conceptual model is a mapping, which specifies for each component in each PSM schema to which component in the PIM schema it corresponds. We exploit this mapping for automatic propagation of changes between the two levels, which simplifies the management of multiple XML schemas [Nečaský et al.(2012a)]. 2.1 Platform-Independent Model A PIM schema S is based on UML class diagrams and models real-world concepts and relationships between them independently of the target platform (implementation). It contains three types of components: classes, attributes and associations with the usual semantics [OMG(2007a), OMG(2007b)]. A sample PIM schema is in [Figure 1]. Formally, we define its simplification in [Definition 1]. Definition 1. A platform-independent (PIM) schema is a triple S = (S c, S a, S r ) of disjoint sets of classes, attributes, and associations, respectively. Class C S c has a name assigned by function name Attribute A S a has a name, data type and cardinality assigned by functions name, type, and card, respectively. Moreover, A is associated with a class from S c by function class. Association R S r is a set R = {E 1, E 2 }, where E 1 and E 2 are called association ends of R. R has a name assigned by function name. Both E 1 and E 2 have a cardinality assigned by function card and are associated with a class from S c by function participant. We will call participant(e 1 ) and participant(e 2 ) participants of R. name(r) may be undefined, denoted by name(r) = λ. For a class C S c, we will use attributes (C ) to denote the set of all attributes of C. Similarly, associations (C ) will denote the set of all associations with C as a participant. 2.2 Platform-Specific Model The platform-specific model (PSM) specifies how a part of the reality modeled on the PIM level is represented in the target platform, XML in our case, which makes the PSM schemas views of the PIM schema. Its advantage is that the designer works in a UML-style way which is more comfortable than editing the XML schema itself and also enables the maintenance of mappings to the PIM level. The individual constructs on the PSM level are, however, slightly modified

Figure 1: Example of a PIM schema to reflect the structure of XML documents. A PSM schema represents an XML format and can be automatically translated to the XML Schema language.

81 Figure 1: Example of a PIM schema to reflect the structure of XML documents. A PSM schema represents an XML format and can be automatically translated to the XML Schema language. Very briefly, classes represent complex types, attributes represent XML attributes or XML elements that have simple types, associations represent the nesting relation. For full translation description see our previous work [Nečaský et al.(2012b)]. Formally, PSM schema is defined by Definition 2. An example is in Figure 2. Definition 2. A platform-specific (PSM) schema S = (S c, S a, S r, S m, C S ) is a tuple of disjoint sets of classes, attributes, associations, and content models, respectively, and one specific class C S S c called schema class. Class C S c has a name assigned by function name Attribute A S a has a name, data type, cardinality and XML form assigned by functions name, type, card and xform, respectively. xform(a ) {e, a} (element or attribute). Moreover, it is associated with a class from S c by function class and has a position assigned by function position within the attributes associated with class(a ). Association R S r is a pair R = (E 1, E 2), where E 1 and E 2 are called association ends of R. Both E 1 and E 2 have a cardinality assigned by function card and each is associated with a class or content model from S c S m assigned by function participant, respectively. We will call participant(e 1) and participant(e 2) parent and child and will denote them by parent(r ) and child(r ), respectively. Moreover, R has a name assigned by function name and has a position assigned by function position within the associations with the same parent(r ). name(r ) may be undefined, denoted by name(r ) = λ.

82 Figure 2: Example of a PSM schema Content model M S m has a content model type assigned by function cmtype. cmtype(m ) {sequence, choice, set}. Sequence and set have their usual semantics, choice means that only one of the modeled variants is actually present in the document. The graph (S c S m, S r) must be a forest of rooted trees with one of its trees rooted in C S. For C S c, attributes (C ) will denote the sequence of all attributes of C ordered by position, i.e. attributes (C ) = (A i S a : class(a i ) = C i = position(a i )). Similarly, content (C ) will denote the sequence of all associations with C as a parent ordered by position, i.e. content (C ) = (R i S r : parent(r i ) = C i = position(r i )). We will call content (C ) content of C. Note that in the full conceptual model we also consider inheritance. One type of inheritance is content reuse. We use this construct (blue class) in our examples, but do not define it formally as it would unnecessarily complicate the formalism. It is sufficient to say that the blue class points at another PSM class and reuses its content in its own e.g., ShipAddr and BillAddr reuse Address. 2.3 Interpretation of PSM schema against PIM schema A PSM schema represents a part of a PIM schema. A class, attribute or association in the PSM schema may be mapped to a class, attribute or association

83 in the PIM schema. The mapping specifies the semantics of classes, attributes and associations of the PSM schema in terms of the PIM schema. The mapping must meet certain conditions to ensure consistency between PIM schemas and the specified semantics of the PSM schema. This mapping is then utilized in various use cases for the conceptual model like XML schema evolution and integration [Nečaský et al.(2012a)]. See [Nečaský et al.(2012b)] for the precise conditions of the mapping. In this paper, we focus on the translation from a PSM schema to Schematron and therefore, the precise definition of interpretation is not necessary here. Note that in [Figure 2] the gray classes and attributes do not have the interpretation. 2.4 Conceptual model summary In summary, the usefulness of our conceptual model for XML can be clearly seen when we, for example, ask questions like In which of our hundred XML schemas used in our system is the concept of a customer represented? and What impact on my XML schemas would this particular change on the conceptual level have?. Even better, with our extensions for evolution of XML schemas [Klímek and Nečaský(2010), Nečaský et al.(2012a)] we can make changes to the PIM schema (e.g. change the representation of a customer s name from one string to firstname and lastname) and those changes can be automatically propagated to all the affected PSM schemas. The question of the effect of those changes on the actual data is discussed in [Malý et al.(2011)], where a method for generating XSLT scripts for data updates is proposed. Thanks to automated translations from PSM schemas to e.g. XML Schema and back [Nečaský et al.(2012b)] we can easily manage a whole system of XML schemas from the conceptual level all thanks to the interpretations. These extensions, are not in the scope of this paper, for details see our previous work. Also, it would be possible to generate a clickable HTML documentation of a system modeled using our conceptual model. With the model, it is also much easier and faster to grasp a system of multiple XML schemas when, for example, negotiating interfaces between two information systems. We already have our model implemented in our tool called exolutio [Klímek et al.(2012)]. 3 Schematron Simply put, schematron is a language which represents the rule-based XML schema languages. These languages are not based on construction of a grammatical infrastructure. Instead, they use rules resembling if-then-else statements to describe constraints. These languages offer the finest granularity of control over the format of the documents [Vlist(June 2002)]. We can even view constructs of other schema languages as a syntactical sugar used instead of sets of

84 rule-based conditions. Schematron was designed in 1999 by Rick Jelliffe. The language was standardized in 2005 as ISO Schematron [Jelliffe(2001)]. However, it is not standalone. It is a general framework which allows schema designers to organize conditions which are evaluated over the given documents. These conditions are described using an underlying XML query language such as the default XPath [Clark and DeRose(1999)]. A result of a validation is a report which contains information about evaluation of these conditions. Schematron is an XML-based language and uses only few elements and attributes for schema description. 3.1 Core constructs Now we describe core Schematron constructs. The root element of every schema is a schema element introducing the required XML namespace 1. A pattern element is a basic building block for expressing an ordered collection of Schematron conditions which are ordered in XML document order. A rule is a Schematron condition which allows a designer to specify a selection of nodes from a given document and evaluation of predicates in the context of these nodes. The rule element has a required context attribute used for an expression in the underlying query language. The value of the context attribute is commonly called a path. Predicates are specified using a collection of assertions. An assertion is a predicate which can be positive or negative. An assertion is represented using the assert and report elements. Both elements have a required test attribute for specification of a predicate using the underlying query language. Both elements also have a text content called natural-assertion. Natural-assertion is a message in a natural language, which a validator can return in the validation report. A positive predicate is represented using an assert element and if it is evaluated as false, we say that the assert is violated and the document is invalid. A negative predicate is represented using a report element and if it is evaluated as true, we say that the report is active and a natural-assertion will be reported. Schematron is not only a validation language. It is a more general XML reporting language [Ogbuji(2004)] where one type of report is an error message. Example 1. The pattern in [Figure 3] selects all triangle elements from a document. If the given triangle has for example four child vertex elements, then the predicate will be false and the specified message will be reported. 3.2 Additional constructs In addition to the core Schematron constructs we mention these additional ones which are relevant for our approach: 1

85 <pattern> <rule context="triangle"> <assert test="count(vertex)=3"> The element triangle should have 3 vertex elements. </assert> </rule> </pattern> Figure 3: Schematron pattern A diagnostic is a natural-language message giving details about a failed assertion, such as found versus expected values and repair hints. It is represented using a diagnostic element with required id attribute and text content with a message. Diagnostics are referenced by assertions using a diagnostics attribute. Phases allow organization of patterns into identified parts. Every Schematron schema has one default phase which includes all patterns. Before validation, it can be determined which phase is used and which patterns are activated. This selected phase is called the active-phase. A phase is represented using a phase element with an id attribute. One phase can have multiple active elements which refer to patterns using a pattern attribute. 4 PSM to Schematron translation A PSM schema models a grammar-based XML format specification and its concepts are interpreted against PIM concepts. There are several problems that we must consider when we want to describe the translation of a PSM schema to a Schematron schema. In particular, we need to identify groups of Schematron rules and associated XPath expressions that impose equivalent constraints on the documents as constructs of grammar-based languages would. Also, we would like to provide design of Schematron schemas, which allows to specify quality validation diagnostics. 4.1 Overall view of the translation The translation algorithm (see [Algorithm 1]) has a PSM schema on the input and it gradually builds the resulting Schematron schema. The generated schema is composed of multiple patterns which cover grammatical structural constraints represented in the PSM schema. This also allows us to distribute various patterns into phases. The validator then can run through selected phases and validate various aspects of the XML document e.g., only attributes, only elements etc., resulting in variable performance and diagnostic properties.

86 Algorithm 1 Overall view of the translation algorithm 1: <schema xmlns=" 2: Generate allowed root element names [Section 4.2]; 3: Generate allowed names [Section 4.3]; 4: Generate allowed contexts [Section 4.4]; 5: Generate required structural constraints [Section 4.5]; 6: Generate required sibling relationships [Section 4.6]; 7: Generate required text restrictions [Section 4.7]; 8: </schema> In the first step ([line 2], we generate Schematron pattern for XML element names, that are allowed inside valid XML documents as roots. Similary, on [line 3]), we generate Schematron patterns for element and attribute names, that are allowed inside valid documents. On [line 4], we produce patterns for allowed contexts i.e., paths where certain names of elements and attributes may occur. The patterns for validation of required complex element structures are produced in the steps on [line 5] and [line 6]. These patterns are more complex, because we must generate an equivalent of regular expressions to obtain the semantics of regular grammars. In the last step on [line 7], the patterns for text restrictions i.e., validation of attribute values and simple element contents, are produced. 4.2 Allowed root element names We need a tool for reporting names of elements which are not allowed in the schema, but are present in the document. Definition 3. An absorbing pattern for a set of paths P = {p 1, p 2,..., p n : p i isanxp athquery} is a Schematron pattern, where the first rules select XML nodes specified by p i and the last rule (called global) selects all other nodes in the XML document. <pattern> <rule context="/purchase"> <assert test="true()"/> </rule> <rule context="/*"> <assert test="false()"> The element <name/> is not allowed as root. </assert> </rule> </pattern> Figure 4: Absorbing Schematron pattern for root elements Example 2. See PSM schema in Figure 2 and the absorbing Schematron pattern in Figure 4. P = {/purchase}.

87 We defined a special kind of a Schematron pattern, which allows a validator to absorb elements (or attributes) specified by paths P. The pattern resembles a sieve, because it checks for all the allowed elements (or attributes) specified by paths and if none of them is found, it matches whatever is found in the path using a wildcard (absorbs it). If the element or attribute is absorbed by the wildcard, it is interpreted as a violation of the expected format which is reported. In comparison to XML Schema validation of root elements, Schematron is equally powerful but in addition allows better diagnostic. Instead of saying that element is missing, Schemarton can report a human readable message, which can be, for example, translated if necessary. 4.3 Allowed XML element and attribute names Production of patterns for checking allowed XML element and attribute names inside validated documents follows a similar algorithm to the one for the root elements. We produce a set P, where p i are all XML element names specified by the PSM schema. These are either named associations that have classes as children (complex XML elements) or names of PSM attributes A with XML form set to xform(a ) = e (simple XML elements). From P, an absorbing pattern is generated the same way as before. XML attributes come from PSM attributes with xform(a ) = a and the pattern is the same except for prefix before the attribute name 4.4 Allowed contexts Now we introduce stricter patterns for checking allowed contexts i.e., paths inside documents. We also generate absorbing patterns, but we need more sophisticated paths, because we absorb only element and attribute names in the declared contexts, so the other names (contexts) break validity Paths overview A path is described using an XPath expression to select nodes from the validated XML document. When nodes are selected, we can evaluate assertions i.e., XPath predicates in the context of these nodes. In general, we have two approaches to how we can describe paths i.e., absolute paths, for example /purchase/shipto or relative paths for example shipto. If we want to design schemas more powerful than DTD i.e., local regular tree grammars (see [Murata et al.(2005)]), we need absolute paths to select nodes from documents. However, relative paths are also important for example to design recursive declarations. There is also a possibility to use predicates in paths. We do not deal with predicates for nodes selection, because we aim to design as simple a Schematron schema as possible.

88 Every complex XML element modeled in a PSM schema can be specified as a regular expression. On these expressions we impose a Single Occurrence Regular Expression (SORE) precondition in [Definition 4]. Every SORE is deterministic as required by the XML specification and more than 99% of the regular expressions in practical schemas are SOREs [Bex et al.(2006)], so the assumption is not very restrictive in practice and at the same time considerably simplifies the translation. Definition 4. Let S be a PSM schema. A SORE precondition is an assumption on S, that every complex element has content described using Single Occurrence Regular Expression i.e., every element (or attribute) name can occur at most once in this regular expression. For instance, (a b) 0..* is a SORE while a(a b) 0..* is not as a occurs twice Paths construction Here we describe the construction of paths for a PSM schema. For each XML element and XML attribute declaration present in a PSM schema, we produce all possible paths (contexts) where they can occur. Firstly, we need to create all paths for a given XML element or XML attribute declaration. Let us mark the declaration the given PSM component X (S a S r). We build an ancestor tree for X which represents all achievable ancestor PSM components of X in the PSM schema. Then we can translate all its paths from leaf nodes to root node into Schematron paths i.e., XPath expressions. For each (X, p) G p p must be unique, which corresponds to the SORE precondition in [Definition 4]. Every created path is associated with a PSM component i.e., a complex element, a simple element or an attribute declaration, and the pairs are placed into the global set of paths G p. In the next step, we perform sorting of G p members. The resulting list G p = {(X, p) : X (S r S a) and p is a path} is used for generation of Schematron rules in the order of this set in the rest of the translation. We sort members of G p using the following ordering: (1) Absolute paths without recursions go first in descending order of length (2) Absolute paths with recursions follow, in descending order of length (3) Relative paths go last, again in descending order of length. Example 3. G p for PSM schema in [Figure 2]: 1. (R purchase, /purchase) 2. (A code, /purchase/@code) 3. (A date, /purchase/@date) (A tester, /purchase/items/item/@tester) 21. (A price, /purchase/items/item/price) 22. (A amount, /purchase/items/item/amount)

89 4.4.3 Pattern for allowed element contexts Now we can produce patterns for allowed contexts. We iterate through all members of the ordered set G p and produce a set of paths P only for complex element names and simple element names (PSM attributes with XML form set to element). In the last step we produce an absorbing pattern for P with *. Similarly, we produce a pattern for allowed attribute contexts. 4.5 Required structural constraints Now we have the absorbing patterns for weak validation of XML documents. These patterns say what is allowed inside the documents. Now we deal with restrictions that the given document must satisfy. Definition 5. A conditional pattern is a Schematron pattern for a list of rules, where each rule is a pair E = {(p, A); where p is a path and A is a set of predicates}. The rules are then validated one by one and the document passes validation by this pattern if and only if all the rules of the pattern are satisfied. For the production of conditional patterns, we need to analyze specifications of complex element contents. The complex element declared in a PSM schema is precisely specified using a regular expression, so we need to analyze such regular expressions and translate them into Schematron predicates. The main idea is to translate a regular expression and the respective parts of its semantics, into more conditional patterns. These patterns then cover the same semantics as the regular expression when they are evaluated together Boolean expressions overview In this section we deal with the part of the regular expression semantics that covers the required parent-child and parent-attribute relationships. It also contains choices among attributes and choices among attributes and elements. The main idea is to translate a given regular expression into a Boolean expression, which can be evaluated in the context of a selected complex element. The expression specifies which child elements and attributes the element must have. Example 4. Consider a regular expression which specifies the complex element item in [Figure 2]: (@code,(@tester price),amount). We translate it into an XPath predicate that we can use in the Schematron assertion (see [Figure 5]). This representation is quite straightforward and corresponds well with grammarbased languages like XML Schema. However, it also comes with disadvantages in the form of poor diagnostics, As with XML Schema validation, when we validate a document using such Schematron rule, we would only get a valid or invalid statement without further details.

90 <rule context="/purchase/items/item"> <assert and and count(price)=0) or and price) and amount" /> </rule> Figure 5: Boolean expression For this purpose, it is more advantageous to go into more detail and write the same rule as multiple simpler rules. In our case we transform the regular expression to a logic formula and the logic formula to a conjunctive normal form as seen in [Figure 6]. The rules have equivalent semantics, because the assert element in the rule represents one clause i.e., disjunction of literals, and the rule is composed of a conjunction of assert elements. We can then insert an error report for each of the assert elements making the diagnostics finer grained and therefore more user friendly. Note that this is also an example of choice between element price and which is not possible in XML Schema but can be done using Schematron. <rule context="/purchase/items/item"> <assert test="@code" /> <assert test="count(price)=0 or count(@tester)=0" /> <assert test="price /> <assert test="amount" /> </rule> Figure 6: Boolean expression in CNF As seen in [Example 4], we need to provide a solution for translation of regular expressions into Boolean expressions and then we need to translate our Boolean expressions into CNF From complex content to Boolean expression In this section we translate a specification of a complex element content in PSM into a Boolean expression. More generally, we translate a regular expression modeled by the complex element declaration i.e., association R S r with a name and with a class as a child, into a Boolean expression that which can be placed into Schematron as an XPath expression. First of all, we define an additional function on the PSM level, descendants. Definition 6. descendants: R (V c, V s, V a) is a function that has an association R that corresponds to an XML element on the input. It returns a 3-tuple (V c, V s, V a) of sets of XML complex element declarations, simple element declarations and attribute declarations, respectively, corresponding to the XML content model of the XML element modeled by R. In the following semantic rewrite rules we also use a version of descendants where the resulting triple (V c, V s, V a) is used as an argument to the XPath function count. In this case V c = (R 1,..., R m), V s = (X 1,..., X n), V a = (A 1,..., A k )

91 is translated into name(r 1)... name(r m)... name(x 1)... name(x k ), where is the union operator of XPath. Example 5. The following examples are valid in [Figure 2]: descendants(r cust) = (, {A name, A phone, A }, {A login}) descendants(r items) = ({R item}, {A price, A amount}, {A code, A tester}) Note that attributes A are in V s here, because they model a simple XML element, not an XML attribute. Definition 7. In the following semantic rewrite rules we use the following functions from our conceptual model (see [Nečaský et al.(2012b)] for full definition). name(z ) = string returns the name of a component Z S c S r S a parent(x ) = R returns the parent association R S r of a component X S c S m child(r ) = X returns the child component X S c S m of an association R S r. parent(child(r )) = R content(x ) = (R 1,..., R n) returns the associations leading from a PSM component X S c S m lower(y ) Z 0 returns the lower cardinality bound of an association or attribute Y S r S a card(y ) returns cardinality of an association or attribute Y S r S a, for example 0..1 or 1..* cmtype(m ) {set, choice, sequence} returns the type of a content model M S m xform(a ) {e, a} returns the form of a PSM attribute - whether it models an XML simple element or an XML attribute Now we introduce function be that is used for translation of a PSM schema into Boolean expressions.it takes a named association R S r with a class as child child(r ) S c, on the input and outputs a Boolean expression. Rather than specify the function procedurally in a form of pseudo-code, we specify its semantics by rewriting rules. In the description of semantics of be we use additional functions. Functions rw, rwatt and rwchoice, where be(r ) = rw(child(r )), take a general PSM component, PSM attribute and PSM choice content model, respectively, and rewrite it into a Boolean expression. When function rw has class C S c on the input (see [Figure 7(a)]), its content is rewritten into logical conjunctions. When function rw has an optional attribute A S a, lower(a ) 0 on the input (see [Figure 7(b)]), it is rewritten into logical disjunction e.g., (@a or count(@a)=0). The rule uses function rwatt for rewriting a PSM attribute into an XML attribute or XML element representation (see [Figure 8]). When function rw has a required attribute A S a, lower(a ) = 1 on the input (see [Figure 7(c)]), it is rewritten using function rwatt directly.

92 C S c, (A 1,..., A n) = attributes (C ), (R 1,..., R m) = content(c ) rw(a 1 )... rw(a n) rw(r 1 )... rw(r m)) (a) Class rewrite rule of rw(c ) A S a, lower(a ) = 0 (rwatt(a ) count(rwatt(a )) = 0) (b) Optional attribute rewrite rule of rw(a ) A S a, lower(a ) 1 (rwatt(a )) (c) Required attribute rewrite rule of rw(a ) Figure 7: Class and attribute rewrite rules of rw A S a, xform(a ) = a (@name(a )) (a) rwatt(a ): Attribute rewrite A S a, xform(a ) = e (name(a )) (b) rwatt(a ): Simple element rewrite Figure 8: Semantic rewrite rules of rwatt Example 6. Consider C Purchase S c and C ShipAddr S c in [Figure 2]. We translate the first class into (@code and...) and the second into (street and city and gps). R S r, lower(r ) = 0, (name (R ) = λ child(r ) / S c) (rw(child(r )) count(descendants(r )) = 0) (a) Optional association rewrite rule of rw(r ) R S r, lower(r ) 1, (name(r ) = λ child(r ) / S c) (rw(child(r )) (b) Required association rewrite rule of rw(r ) Figure 9: Unnamed association rewrite rules of rw When rw has an optional association R S r, lower(r ) = 0, name(r ) = λ on the input, which is not named and thus does not form a complex element declaration (see [Figure 9(a)]), it is rewritten into a logical disjunction. When function rw has a required association R S r, lower(r ) 1, name(r ) = λ on the input, which does not form a complex element declaration (see [Figure 9(b)]), a child of the association, which always exists in PSM, is rewritten. When rw has an optional association R S r, lower(r ) = 0, name(r ) λ on the input, which is named and thus is a complex element declaration (see [Figure 10(a)]), it is rewritten into a logical disjunction of XML element names. When function rw has a required association R S r, lower(r ) 1, name(r ) λ on the input, which is a complex element declaration (see [Figure 10(b)]), it is rewritten into an XML element name. When function rw has either the sequence or set content model M S m on the input (see [Figure 11(a)] and [Figure 11(b)]), its content is rewritten into

93 R S r, lower(r ) = 0, name (R ) λ, child(r ) S c (name(r ) count(name(r )) = 0) (a) Optional named association rewrite rule of rw(r ) R S r, lower(r ) 1, name (R ) λ, child(r ) S c (name(r )) (b) Required named association rewrite rule of rw(r ) Figure 10: Named association rewrite rules of rw M S m, cmtype(m ) = sequence, (R 1,..., R n) = content(m ) (rw(r 1 )... rw(r n)) (a) Sequence rewrite rule rw(m ) M S m, cmtype(m ) = set, (R 1,..., R n) = content(m ) (rw(r 1 )... rw(r n)) (b) Set rewrite rule rw(m ) M S m, cmtype(m ) = choice (rwchoice(m )) (c) Choice rewrite rule rw(m ) Figure 11: Content model rewrite rules of rw logical conjunctions. Sequence and set content models have the same semantics from the point of view of Boolean expressions, because sequence (a,b) is equivalent to set {a,b} i.e., (a and b). When rw has the choice content model on the input (see [Figure 11(c)]), it is rewritten using a special function rwchoice. rwchoice rewrites a content model M S m without XML attribute declarations in its context i.e., descendants(parent(m )) = (V c, V s, V a), V a = 0. The content of M is rewritten into disjunctions (see [Figure 12(a)]). We cannot presume exclusive disjunction between elements in Boolean expressions, because it is not possible to check choices among elements using Boolean expressions. Example 7. Consider a regular expression ((a b)+). We cannot translate this expression into a Boolean expression ((a and count(b)=0) or (count(a)=0 and b)), because when we would validate e.g., aababba, this would be invalid even though it matches the regular expression. However, we can translate it into (a or b), which is a weaker expression, but works even in this case. We check choices among attributes and choices among attributes and elements using Boolean expressions (see [Figure 12(b)]). We generate exclusive disjunctions for the choice content model. Note that we used another function rwchoicenegation which allows to translate declarations to the argument of count.

94 M S m, (R 1,..., R n) = content(m ), descendants(parent(m )), V a = 0 (rw(r 1 )... rw(r n)) (a) Choice without attributes rewrite rule of rwchoice(m ) M S m, (R 1,..., R n) = content(m ), descendants(parent(m )), V a 1 ( n i=1 (rw(r i ) count( n j i rwchoicenegation(r j )) = 0)) (b) Choice with attributes rewrite rule of rwchoice(m ). is the union operator from XPath R S r, name(r ) λ, child(r ) S c name(r ) (c) Named association rewrite rule of rwchoicenegation(m ) R S r, (name(r ) = λ child(r ) / S c) descendants(r ) (d) Association rewrite rule of rwchoicenegation(m ) Figure 12: Content model rewrite rules of rw Example 8. Consider regular We translate it using the rule in [Figure 12(b)] into a Boolean expression ((a and or (@b and or (@c and The rule in [Figure 12(b)] is correct when we accept another precondition for PSM attributes (see [Definition 8]). Definition 8. Let S = (S c, S a, S r, S m) be a PSM schema. The attribute cardinalities precondition is an assumption on S saying A S a, xform(a ) = a, it must hold that card(a ) = 0..1 or card(a ) = In addition, A can only be a descendant of unnamed associations R S r, (name(r ) = λ child(r ) / S c), where card(r ) = 0..1 or card(r ) = In another words, we simplify our approach by presuming that attributes are either optional or required and then we ensure that this condition is not circumvented by cardinalities of unnamed parent associations. Example 9. The following are regular expressions which satisfy the attribute cardinalities: (@a,@b,(@c (d,e,f))) (@a 0..1,@b,(@c 0..1 (d,e,f)) 0..1) The following example is a regular expressions which dos not satisfy the attribute cardinalities precondition: (@a 0..1,@b,(@c (d,e,f)) 0..3)

95 4.5.3 From Boolean expression to CNF Now we have a Boolean expression derived from a complex element declaration. This expression is composed only of brackets, conjunctions, disjunctions and literals (name or count(...) = 0 used as a negation). We can translate such expressions into their equivalent conjunctive normal forms (CNF) using the following rule: (A B) C (A C) (B C) From CNF we can translate clauses with disjunctions of literals into Schematron predicates. CNF may contain dead clauses such or count(@a)=0 and may be optimized by removing these. We call the function which translates a Boolean expression into a collection of clauses with disjunctions of literals cnf, e.g. cnf ((a and b) or c)) = {(a or c), (b or c)} Producing patterns for structural constraints For the actual production of conditional Schematron patterns from our Boolean expressions in CNF we use a simple algorithm which creates two conditional patterns. One contains all rules for elements and one for attributes. The exception is when there is a rule containing elements and attributes at once. Then it goes into the second pattern. We split the rules into two patterns to support possible distribution of patterns into phases. For the pseudo-code and description see our conference paper [Benda et al.(2013)]. In [Figure 13] we show an example of the two resulting conditional patterns which represent the XML element purchase from [Figure 2]. <rule context="/purchase "> <rule context="/purchase "> <assert test="shipto" /> <assert test="@code" /> <assert test="billto" /> <assert test="@date" /> <assert test="cust" /> <assert test="@version" /> <assert test="items" /> </rule> </rule> Figure 13: Conditional Schematron patterns for the purchase XML element 4.6 Required sibling relationships In the previous section we generated structural constraints using Boolean expressions, which allow us to validate parent-child relationships. So far we did not deal

96 with the order of child elements inside a parent element. Here we describe our approach based on the theory of regular expressions. We build a finite state automaton for a given regular expression. We deal only with SOREs so we can build a deterministic SORE automaton, where every name of an XML element is assigned to at most one inner state and the automaton has one initial and one final state. Then we translate information obtained from this structure into Schematron conditions. We represent the transition function of the automaton using conditional pattern and we cover for example the order of XML elements (sequences, choices among elements) and also cardinalities zero or one (0..1, or?), just one (1..1), zero or more (0..*, or Kleene star *), one or more (1..*, or Kleene cross +). We can also provide clear natural-language assertions and diagnostics. There are also some problems and exceptions. Firstly, we can not cover arbitrary numeric intervals of regular expressions using this approach (it is possible to create an automaton with numeric intervals, but it is not possible to represent it in Schematron). We need another approach for numeric constraints in general, which is our future work. For example, (a 0..3, b 1..4) is easy to describe by the count function. However, (a 0..3, b 1..4) 2..8 is more difficult and we were unable to devise a universal approach so far. Secondly, the PSM content model set complicates construction of the algorithm. The restriction [Definition 9] for the PSM content model set is similar to the restrictions applied to the XML Schema construct all. Definition 9. Let S be a PSM schema. The SET precondition, which is an assumption on S, that for each content model set it must hold that it has named associations with classes as children in its content and the content model is a descendant of associations R S r, (name(r ) = λ child(r ) / S c) in the complex content, where card (R ) = 0..1 or card (R ) = Now we can presume that we can build the automaton for each complex element declaration i.e., named association with class as a child. We also need to translate the obtained information into Schematron rules. For each complex element and for elements in its content we produce a set of predicates. These predicates use the following-sibling XPath axis. For each of the obtained predicates we generate a conditional pattern in the step on [line 6] in [Algorithm 1]. Example 10. We will transform the content of element cust in [Figure 2]. It is specified by name, phone 1..*, We transform it to a subexpression without attributes - we do not need them for sibling relations because XML attributes are not ordered. Then we build a deterministic finite SORE automaton corresponding to the subexpression and from it, we get the Schematron rules in [Figure 14]. The first rule says that the first child element

97 <rule context="/purchase/cust"> <assert test="*[1][self::name]" /> </rule> <rule context="/purchase/cust/name"> <assert test="following-sibling::*[1][self::phone]" /> </rule> <rule context="/purchase/cust/phone"> <assert test="following-sibling::*[1][self::phone or self:: ] or not(following-sibling::*)" /> </rule> <rule context="/purchase/cust/ "> <assert test="not(following-sibling::*)" /> </rule> Figure 14: SORE automaton in Schematron is name. The second rule says that the immediate following sibling is phone. The third rule says that after phone there is either phone or or no element. The last rule says that has no following sibling. 4.7 Required text restrictions In the step on [line 7] of [Algorithm 1] we generate patterns for data types validation as extension rules of our predefined data type rules. For details, see [Benda et al.(2013)]. In comparison to XML schema, our approach is a bit weaker as we did not manage to create support for some datatypes like xsd:datetime using just XPath 1.0. If we used XPath 2.0 in Schematron, we could support all XML Schema simple datatypes. 4.8 Translation summary In this section, we introduced the problem of automatic construction of Schematron schemas from PSM schemas. The translation is not simple, because we have different models - grammar-based PSM schema (and XML Schema, DTD, etc.) and the rule-based Schematron. However, we showed that Schematron is a very powerful language and it can express many grammatical structural constraints from the grammar-based languages and more. We started with production of absorbing patterns, which allow to validate allowed occurrences of XML elements and XML attributes inside validated XML documents. Then we produced conditional patterns for validation of required grammatical structural constraints. We analyzed the most used parts of regular expressions which can be represented in Schematron. Then we generated patterns for validation of data types for simple element contents and attribute values.

98 There are some limitations to our approach that, however, do not seem critical at the moment. The most visible one is the lack of support for arbitrary numeric intervals in cardinalities. We only support the usual 0..*, 0..1, 1..*, This is because the support for arbitrary intervals would necessarily lead to Schematron code explosions which would only complicate and slow down the validation process. 5 Related work In parallel to the research of translation of PSM schemas to Schematron, other PSM schema improvements are also being researched. In particular the support for Object Constraint Language (OCL) [OMG(2012)] and its translation to Schematron for the specification of integrity constraints, where Schematron is used as a complement of grammar-based schemas. These patterns for integrity constraints generated from OCL may be potentially merged with our Schematron schemas. To our best knowledge, little work has been done in the area of translations between Schematron and other XML schema languages. There are sources not based on academic research which provide some basic ideas and techniques for translation of grammar-based schemas to Schematron schemas and vice versa. Most work in this area has been done by Rick Jelliffe and his company Topologi 2. They have implemented an XSD to Schematron converter 3, because their customers preferred Schematron diagnostics over XSD validation. The generated schemas are called Schematron-ish grammars. In [Nečaský et al.(2012b)], we provide formal description of mutual translation between PSM schemas and regular tree grammars. Let us take a look at the question of translation of a PSM schema directly to Schematron versus the translation of a PSM schema to XSD and then using the approach mentioned above to translate XSD to Schematron. In our previous work [Nečaský et al.(2012b)] we showed that each PSM of our conceptual model can be translated to a regular tree grammar (RTG) and vice versa. In [Murata et al.(2005)] the author shows that each XSD schema can be translated to RTG. However, it is not possible to translate each RTG to an XSD schema e.g., a choice between two attributes@a cannot be translated to XML Schema. Therefore, we cannot translate arbitrary PSM (or RTG) to XSD, we must accept some constraints on PSM. Also, so far, we cannot translate arbitrary XSD to Schematron. For example, we do not support numeric intervals for element occurence e.g., 1..3 because in our experience, these do not occur much often and pose a non-trivial problem to be solved. If we accept constraints ensuring that our PSM can be translated to XML Schema and we do not use constructs that we cannot translate to Schematron so

99 far, then the two ways of translation i.e. PSM XSD Schematron and PSM Schematron will produce the same result from the structural point of view. However, one of the main advantages of using the direct approach is, besides the slightly higher expressivity of the resulting schema, the possibility to provide better diagnostic output of a schema validator. We have support for custom error messages in PSM which can only be preserved when translating directly from PSM to Schematron. This is because XML Schema does not have support for them and they would be lost in that step of the translation. 6 Evaluation and Implementation With our proposed method we generated several Schematron schemas in various data domains using our conceptual model and verified that we can successfully validate the corresponding XML document instances. The schemas are verbose and cannot be shown here whole due to space limitations. The Schematron schema 4 and the XSD schema 5 (for comparison) generated from our conceptual model in [Figure 2] are available online. Their structure is, however, shown in our examples throughout the paper. During our experiments we found the Schematron based validation as easy to use from a domain expert s perspective as a validation using XML Schema would be given that both can be generated from our conceptual model for XML. The downside of Schematron mentioned in our motivation, which is its verbosity, is not a problem in the end because the user does not need to read the actual generated Schematron. He only needs to give it as an input to a Schematron validator. From the validation performance point of view, rule-based validation (e.g. Schematron) is computationally more expensive than the linear validation using grammar-based languages (such as XML Schema) [Nálevka(2010)]. This could be a problem in an environment that requires high performance validations, such as routing of XML messages. Nevertheless, when performance is not an issue or when validating against complex XML formats, the benefits in the form of better diagnostics are more important. The reward for using our approach is much clearer diagnostics of a possible problem in the validated XML document because Schematron supports userfriendly and descriptive error messages. Also, its expressive power is greater than that of XML Schema [Murata et al.(2005)]. This can be seen in [Figure 2], where we use a choice between attributes, which is not possible to express in XML Schema, but more importantly, the Schematron schema can also contain integrity constraints which cannot be represented in XML Schema. Another advantage of Schematron is the one we mentioned earlier that the resulting schema can be split into phases, which can be selectively used for validation

100 of various aspects of XML documents. In addition, validation of Schematron can be done solely by using an XSLT transformer, which is wide-spread and available, for example, in web browsers. This is in contrast to XML Schema validators which are standalone components. Our experiments were done using our implementation of the conceptual model for XML, exolutio. exolutio is an application developed in our research group. Its base is the formalism for our conceptual model for XML described in [Nečaský et al.(2012b)] and a complex system of operations and their propagation between the levels of abstraction described in [Nečaský et al.(2012a)]. In addition, it is a platform where novel extensions to XML schema modeling and evolution are implemented. One of them is the approach described in this paper. 7 Conclusions and Future Work In this paper, we briefly introduced our conceptual model for XML as a basis for modeling and maintenance of XML schemas independent of the target schema language. Then we introduced Schematron, a rule-based language that can be used for XML schema description, and its constructs. Next, we described in detail how a schema from our conceptual model can be translated to Schematron and described the advantages over grammar-based languages such as XML Schema. We briefly described the implementation of the presented approach in our tool, exolutio. We compared our direct translation from our conceptual model to Schematron and the indirect translation from our conceptual model to XML Schema and then to Schematron. We concluded that the major advantage of the direct approach is the possibility of user-friendly validator messages, where XML Schema is limited to a valid or non-valid statement. Schematron can provide useful, human readable diagnostic. We showed that with the direct translation, we can cover constructs that are impossible to model using XML Schema. We still need to further investigate the possibilities of translation of numeric interval element occurrences, such as 1..3 which we omitted so far because they are rarely used in XML schemas. Usually, this would be modeled as 1..*. Also, we are investigating the possibility of a rule-based PSM in our conceptual model that could suit Schematron better than the current grammar-based one. Acknowledgment This work was supported in part by the Czech Science Foundation (GAČR), grant number P202/11/P455 and in part by grant SVV References [Benda et al.(2013)] Benda, S., Klímek, J., Nečaský, M.: Using Schematron as Schema Language in Conceptual Modeling for XML ; F. Ferrarotti, G. Grossmann, eds.,

101 Conceptual Modelling 2013 (APCCM 2013); volume 143 of CRPIT; 31 40; ACS, Adelaide, Australia, [Bex et al.(2006)] Bex, G. J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data ; Proceedings of the 32nd international conference on Very large data bases; VLDB 06; ; VLDB Endowment, [Clark and DeRose(1999)] Clark, J., DeRose, S.: XML Path Language (XPath) Version 1.0; W3C, [Jelliffe(2001)] Jelliffe, R.: The Schematron An XML Structure Validation Language using Patterns in Trees; ISO/IEC 19757, [Jelliffe(2007)] Jelliffe, R.: Converting XML Schemas to Schematron ; (2007). [Klímek and Nečaský(2010)] Klímek, J., Nečaský, M.: Integration and Evolution of XML Data via Common Data Model ; Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, March 22-26, 2010; ACM, New York, NY, USA, [Klímek et al.(2012)] Klímek, J., Malý, J., Mlýnková, I., Nečaský, M.: exolutio Tool for XML Schema and Data Management ; Dateso 2012 Annual International Workshop on DAtabases, TExts, Specifications and Objects; 69 80; CEUR Workshop Proceedings, [Malý et al.(2011)] Malý, J., Mlýnková, I., Nečaský, M.: XML Data Transformations as Schema Evolves ; J. Eder, M. Bielikova, A. Tjoa, eds., Advances in Databases and Information Systems; volume 6909 of Lecture Notes in Computer Science; ; Springer Berlin Heidelberg, [Malý and Nečaský(2012)] Malý, J., Nečaský, M.: Utilizing new capabilities of XML languages to verify integrity constraints ; Proceedings of Balisage: The Markup Conference 2012; volume 8; [Miller and Mukerji(2003)] Miller, J., Mukerji, J.: MDA Guide Version 1.0.1; Object Management Group (2003). [Murata et al.(2005)] Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML Schema Languages Using Formal Language Theory ; (2005); cobase.cs.ucla.edu/tech-docs/dongwon/mura0619.pdf. [Nálevka(2010)] Nálevka, P.: Grammar vs. Rules ; (2010); blogspot.com/2010/05/grammar-vs-rules.html. [Nečaský et al.(2012a)] Nečaský, M., Klímek, J., Malý, J., Mlýnková, I.: Evolution and Change Management of XML-based Systems ; Journal of Systems and Software; 85 (2012a), 3, [Nečaský et al.(2012b)] Nečaský, M., Mlýnková, I., Klímek, J., Malý, J.: When conceptual model meets grammar: A dual approach to XML data modeling ; Data & Knowledge Engineering; 72 (2012b), 0, [Ogbuji(2004)] Ogbuji, U.: A hands-on introduction to Schematron; IBM, [OMG(2007a)] OMG: UML Infrastructure Specification 2.1.2; Object Management Group (2007a). [OMG(2007b)] OMG: UML Superstructure Specification 2.1.2; Object Management Group (2007b). [OMG(2012)] OMG: Object constraint language specification, version ; (2012); [Vlist(June 2002)] Vlist, E.: XML Schema The W3C s Object-Oriented Descriptions for XML; O Reilly Media, June 2002.

102 90

103 Chapter 4 Semi-automatic Integration of Web Service Interfaces Jakub Klímek Martin Nečaský Published in the Proceedings of 2010 IEEE 17th International Conference of Web Services, Miami, FL, July IEEE, ISBN Pages

104 92

2010 IEEE International Conference on Web Services Semi-automatic Integration of Web Service Interfaces Jakub Klímek and Martin Nečaský Department of Software Engineering Faculty of Mathematics and

105 2010 IEEE International Conference on Web Services Semi-automatic Integration of Web Service Interfaces Jakub Klímek and Martin Nečaský Department of Software Engineering Faculty of Mathematics and Physics, Charles University in Prague Prague, Czech Republic {klimek, Abstract Modern information systems may exploit numerous web services for communication. Each web service may exploit its own XML format for data representation which causes problems with their integration and evolution. Manual integration and management of evolution of the XML formats may be very hard. In this paper, we present a novel method which exploits a conceptual diagram. We introduce an algorithm which helps a domain expert to map the XML formats to the conceptual diagram. It measures similarities between the XML formats and the diagram and adjusts them on the base of the input from the expert. The result is a precise mapping. The diagram then integrates the XML formats and facilitates their evolution - a change can be made only once in the diagram and propagated to the XML formats. Keywords-XML schema, conceptual modeling, reverseengineering, integration I. INTRODUCTION Today, XML is a standard for communication among web services. A web service provides an interface composed of several operations. The structure of incoming and outgoing messages is described in a form of XML schemas. If the XML schemas of communicating web services differ, the problem of their integration comes to the scene. We aim at the problem of integration of XML schemas by lifting them to a common conceptual schema. In our previous work [1], [2], we have introduced a framework for XML schema integration and evolution. It supposes a set of XML schemas that are conceptually related to the same problem domain. As a problem domain, we can consider, e.g., purchasing products. Sample XML schemas may be XML schemas for purchase orders, product catalogue, customer detail, etc. The central part of the framework is a conceptual schema of the problem domain. Each XML schema is then mapped to the conceptual schema. In other words, the conceptual diagram integrates the XML schemas. We then exploit the mappings to evolve the XML schemas when a change occurs. Simply speaking, the change is made only once at the conceptual level and then propagated to the affected XML schemas. It is also possible to exploit the mappings to derive interfaces of semantic web services described in SAWSDL [3] as we show in [4]. Contributions. In practice, a conceptual diagram and XML schemas exist separately, i.e. there are no mappings between both levels. This disallows to exploit the integration and evolution capabilities of our framework. In our work [5], we have introduced a method for deriving required XML schemas from the conceptual diagram. However, it does not consider an existing XML schema that needs to be somehow mapped to the conceptual diagram. In this work, we introduce a reversed method which allows to (1) correctly map a supplied XML schema to the conceptual diagram, and (2) adapt the conceptual diagram if a particular part of the XML schema can not be mapped. Our aim is not to develop new methods for measuring schema similarities. These methods have been already intensively studied in the literature. Instead, we exploit the existing ones and combine them together. For this, we provide an algorithm skeleton that can be supplemented by various similarity methods. An important contribution of the method, not considered by existing similarity methods, is an active participation of a domain expert. This is necessary, since we need to achieve exact mapping. The outline of the paper is following. In Section 2, we present related work. In Section 3, we briefly present a simplified version of our conceptual model for XML. Section 4 introduces an algorithm which assists a domain expert during mapping discovery. In Section 5, we introduce two kinds of precision measures and present some experimental results and conclude in Section 6. II. RELATED WORK Recent literature has been focused on a discovery of mappings of XML formats to a common model. We can identify several motivations. Firstly, XML schemas are hardly readable and a friendly graphical notation is necessary. This motivation has appeared in [6][7] or [8]. A survey of these approaches can be found in [9]. They introduce an algorithm for automatic conversion of a given XML schema to a UML class diagram. The result exactly corresponds to the given XML schema. However, these approaches can not be applied in our case we need to map an XML schema to an existing conceptual diagram. Secondly, there are approaches aimed at an integration of a set of XML format into a common XML format. These works include, e.g. the DIXSE framework [10] or /10 $ IEEE DOI /ICWS

106 Xyleme project [11]. They have a similar idea to derive a common abstract XML format from the existing XML formats. The mappings of the XML format to the abstract format are discovered automatically. The mappings can then be checked by a domain expert who can also specify additional mappings manually in the case of DIXSE framework. These approaches are closer to our work. However, they do not consider mapping of XML formats to a more abstract conceptual diagram which needs to be adapted when necessary. Moreover, they do not consider a domain expert participating directly in the mapping discovery process. Thirdly, there are approaches that convert or map XML formats to ontologies. DTD2OWL [12] presents a simple method of automatic translation of an XML format with an XML schema expressed in DTD into an ontology. More advanced methods are presented in [13] and [14]. They both introduce an algorithm that automatically maps an XML format to an ontology. This is close to our approach since a conceptual diagram can be understood as an ontology. In both cases, the domain expert can edit the discovered mappings but is not involved in the discovery process directly. Many of these approaches widely exploit research results of the schema matching community. There have been introduced many methods or systems for automatic discovery of mappings between two given schemas based on measuring syntactical and semantical similarity of strings as well as measuring structural similarities, e.g. [15][16][17]. Nice surveys of these approaches can be found in [18][19] or [20]. The purpose of our work is not to introduce new similarity methods. We exploit and adapt existing ones to be applicable when mapping XML formats to the conceptual diagram. We also extend these methods with an active participation of the domain expert. We will use only basic similarity methods and show how they can be substituted with advanced methods introduced in the recent literature. In our previous work [21] we have introduced an algorithm which discovers mappings of XML formats to a conceptual diagram. It was a theoretical approach without implementation. Its time complexity was too high as its search space was very large. It was also complicated for the domain expert to participate in the mapping process. III. CONCEPTUAL MODEL FOR XML In this section, we introduce a conceptual model for XML. It is a simplified version of our previously introduced conceptual model called XSem [5] which does not allow to model all features of modern XML schema languages such as choices. However, they are not necessary for this paper. The conceptual model is based on the Model-Driven Architecture (MDA) [22] which is an approach to modeling software systems. It proposes to model a system on several layers of abstraction. In our work, we employ two layers. The first layer contains a platform-independent model (PIM) which focuses on the structure and processing of the system but hides details necessary for a particular platform. The layer below contains a platform-specific model (PSM) which combines the PIM with an additional focus on the detail of the use of a specific platform by a system. In this work, we understand the XML data model as a platform. Each PSM models a particular XML format. We introduce the notions of PIM and PSM formally in this section. We firstly introduce several symbols. L denotes the set of all string labels. L a and L e are two sets s.t. L a L e = L, L a L e = and each label in L a starts with symbol. D denotes the set of all basic data types such as string, integer, etc. C N (N { }) is a set of cardinality constraints where N denotes the set of natural numbers and ( (x, y) C) (x y y = ). Finally, P S and O S denote the power set of a set S and the set of all ordered sequences of elements of S, respectively. The PIM meta-model is de-facto the model of UML class diagrams. It introduces three modeling constructs: PIM class, PIM attribute and PIM binary association. We provide its formalization in Definition 1. Definition 1. A PIM is a 9-tuple M =(C, A, R, name, type, attrs, ends, acard, rcard) where C, A and R are sets of PIM classes, PIM attributes and PIM associations, respectively, name : C A R Lis a function which assigns a label to each PIM class, attribute or association; the label is called name, type : A Dis a function which assigns a data type to each PIM attribute, attrs : C P A is a function which assigns a set of PIM attributes to each PIM class s.t.: ( A A)( C C)(A attrs(c)), and ( C 1,C 2 C)(attrs(C 1 ) attrs(c 2 )= ), ends : R ( C 2) is a function which assigns a set of two PIM classes to each PIM association; for R R with ends(r) ={C 1,C 2 } we say that C 1 and C 2 participate in R, acard : A Cis a function which assigns a cardinality to each PIM attribute, rcard : R C C is a function which assigns a cardinality to each PIM class C C and PIM association R s.t. C participates in R. For a given PIM M, we will use an auxiliary function class : A Cdefined as ( A A,C C)(class(A) =C A attrs(c)). A sample PIM is depicted in Figure 1. It models a simple jobs and education domain. We will further need a construct called a PIM path which is defined by Definition 2. Definition 2. Let M = (C, A, R, name, type, attrs, ends, acard, rcard) be a PIM. A PIM path P is an 308

Figure 2. Sample PSM Figure 1. Employment PIM ordered sequence R 1,...,R n O R where ( 1 i n)) (ends(r i )={C i 1,C i }). C 0 and C n are called start and end of P, respectively.

The PSM meta-model consists of three modeling constructs which reflect the PIM constructs: PSM class, PSM attribute, and PSM binary association. Its formalization is in Definition 3.

107 Figure 2. Sample PSM Figure 1. Employment PIM ordered sequence R 1,...,R n O R where ( 1 i n)) (ends(r i )={C i 1,C i }). C 0 and C n are called start and end of P, respectively. We will use auxiliary functions start, end : O R C defined for each PIM path P which return the start and end of P, respectively. The PSM meta-model consists of three modeling constructs which reflect the PIM constructs: PSM class, PSM attribute, and PSM binary association. Its formalization is in Definition 3. Definition 3. A PSM is a 10-tuple M = (C, A, R, name, type, xml, attrs, content, acard, rcard ) where C, A are sets of PSM classes and attributes, resp., R C C is a set of oriented PSM associations; for R =(C 1,C 2) R we call C 1 parent and C 2 child of R, name : C A L is a function which assigns a label to each PSM class or attribute; the label is called name, type : A D is a function which assigns a data type to each PSM attribute, xml : C A L is a function which assigns a label to each PSM class or attribute where ( C C )(xml (C ) L e ); the label is called xml label, attrs : C O A is a function which assigns a sequence of PSM attributes to each PSM class s.t.: ( A A )( C C )(A attrs (C )), and ( C 1,C 2 C )(attrs (C 1) attrs(c 2)= ), content : C O(R ) is function which assigns a sequence of PSM associations to each PSM class s.t. ( C C ) ( R R ) (R content (C ) parent(r )=C ), acard : A C is a function which assigns a cardinality to each PSM attribute, rcard : R C is a function which assigns a cardinality to each PIM association. Moreover, there must be a PSM class C C which is not a child of any PSM association in R. C is called the root PSM class of M. Any other PIM class in C \{C } must be a child of exactly one PSM association in R. We use a function class : A C defined as ( A A,C C )(class (A )=C A attrs (C )) and functions parent, child : R C assigning the parent and child to each R R, respectively. The definition ensures that the graph with a set of nodes C and a set of oriented edges R is an oriented rooted tree. A sample PSM is depicted in Figure 2. PSM components are visualized similarly to PIM components. In addition, for a PSM class C, xml (C ) is depicted above the rectangle of C. For a PSM attribute A, we depict name (A ) and xml (C ) separated by the word as. If name (A ) = xml (C ), we show only name (A ). We view PSM components from grammatical and conceptual perspective. From the grammatical perspective, a PSM class C models XML elements with the name specified by xml (C ) and content specified by attrs (C ) and content (C ). A PSM attribute A attrs (C ) models XML elements with a simple content (if xml (A ) L e ) or XML attributes (if xml (A ) L a ). In both cases, the name is specified by xml (A ). A PSM association R content (C ) models hierarchical parent-child relationships between XML elements modeled by parent(r ) and child(r ). Each PSM can be automatically translated to an XML schema in a particular XML schema language. Conversely an XML schema can be translated to a PSM. See [5] and [21] for details. A partial translation of our sample PSM expressed in DTD is depicted in Figure 2. From the conceptual perspective, the PSM maps each XML element and attribute to a PIM class or attribute and each XML parent-child relationship to a PIM association. This mapping is formally expressed as an interpretation introduced by Definition

108 Definition 4. An interpretation I of a PSM M =(C, A, R,name,type,xml, attrs,content, acard, rcard ) against a PIM M =(C, A, R, name, type, attrs, ends, acard, rcard) is a total function defined as C then I(P ) C; if P A then I(P ) A; R then I(P ) R. The following conditions must be satisfied: (1) ( A A )(class(i(a )) = I(class (A ))) (2) ( R R )(ends(i(r )) = {I(parent (R )),I(child (R ))}) A part of an interpretation of our sample PSM against the sample PIM is as follows: C : I(Employee) =Company; I(Country) =Country; I(WorkExperience) =Job;... A : I(Employer.name) =Company.name;... R : I((Employer,Country)) = {Company,Address}, {Address,Country} ; I((Employer,WorkExperience) = {Company,Job} ;... It can be easily verified that it satisfies the conditions given by Definition 4. IV. ALGORITHM We introduce an algorithm which builds an interpretation I of a PSM against a PIM. I must be correct in the formal sense, i.e. it must fulfil Definition 4. Moreover, it must be correct in the conceptual sense, i.e. a PSM component and its PIM interpretation must conceptually correspond to the same real-world concept. We ensure the formal correctness. The conceptual correctness is ensured by a domain expert. The algorithm works in two phases. Firstly, it measures initial similarities between PSM and PIM attributes and classes. Secondly, it builds the interpretation with an assistance of a domain expert. Formally, we will suppose a PSM M =(C, A, R,name,type,xml, attrs,content, acard, rcard ) and a PIM M =(C, A, R, name, type, attrs, ends, acard, rcard) on the input. The output of the algorithm is an interpretation I of M against M. We will need to measure a similarity of two strings S str (s 1,s 2 ). There are various known string similarity methods, e.g. edit distance, N-grams, etc. We use a simple method measuring the length of their common substring. A. Measuring Initial Similarity Attributes. Let (A,A) A A. The similarity of A and A is a weighted sum S init attr (A,A) = w init attr S type (A,A)+ (1 w init attr ) max{s str (name (A ),name(a)), S str (xml (A ),name(a))} The first component, S type (A,A), measures the type similarity of both attributes. For this, we use the identity function in this paper for simplicity. It is possible to employ more advanced techniques reflecting, e.g. sub-typing or attribute cardinalities. The second component is the maximum of two string similarities. w init attr (0, 1) is a weighting factor which is set by the expert. Classes. Let (C,C) C C. The similarity between C and C is a weighted sum S init class (C,C) = w init class S init attrs (C,C)+ (1 w init class ) max{s str (name (C ),name(c)), S str (xml (C ),name(c))} w init class (0, 1) is a weighting factor. S init attrs (C,C) measures a similarity between attrs (C ) and attrs(c). It is defined as S init attrs (C,C) = A attrs (C ) (MAX A attrs(c) S init attr (A,A)), i.e. it finds for each PSM attribute A attrs (C ) the most similar PIM attribute A of C and summarizes these similarities. B. Building Interpretation The second part of the algorithm iteratively traverses the PSM classes in M in pre-order and helps the domain expert to build the interpretation. Individual steps are shown in Algorithm 1. For an actual PSM class C C, the algorithm firstly constructs I(C ) (lines 2-6). Secondly, it constructs I(A ) for each A attrs(c ) (lines 7-19). Finally, it constructs I(R ) for each R content(c ) (lines 20-22). It can be shown that this algorithm runs in O(N 3 ) where N is the number of PSM classes and in O(n log(n)) where n is the number of PIM classes. Algorithm 1 Interpretation Construction Algorithm 1: for all C C in post-order do 2: for all C Cdo 3: S class (C,C) w class S init class (C,C)+ 1 S adj class (C,C) (1 w class ) 4: end for 5: Offer the list of PIM classes sorted by S class to the domain expert. 6: I(C ) C where C C is the PIM class selected by the domain expert. 7: for all A attrs(c ) do 8: for all A Ado 9: S attr (A,A) w attr S init attr (A,A)+ 1 μ(i(c ),class(a))+1 (1 w attr ) 10: end for 11: Offer the list of PIM attributes sorted by S attr to the domain expert. 12: I(A ) A where A A is the PIM attribute depicted by the domain expert. 13: if I(class (A )) class(a) then 14: Create PSM class D C ; I(D ) class(a) 15: Put A to attrs (D ) 16: Create PSM association R =(C,D ) R 17: Put R at the beginning of content(c ). 18: end if 19: end for 20: for all R content(c ) do 21: I(R ) P where P is the PIM path connecting I(C ) and I(child (R )) s.t. μ(i(c ),I(child (R ))) is minimal. 22: end for 23: end for 310

109 Figure 3. Employment PSM 1) Building Class Interpretation: To construct I(C ), the algorithm firstly computes S class (C,C) for each C C at line 3. It is a weighted sum of two similarities. The former is the initial similarity S init class (C,C). The other is a reversed class similarity adjustment S adj class (C,C) which we discuss in a while. The algorithm then sorts the PIM classes by their similarity with C and offers the sorted list to the domain expert at line 5. The expert selects a PIM class from the list and the algorithm sets I(C ) to this selected class at line 6. D 1... C (a) PSM Figure 4. μ( C, I(D 1 ) ) D n I(D 1 ) C (b) PIM Initial Class Similarity Adjustment I(D n ) μ( C, I(D n ) ) Class similarity adjustment S adj class (C,C) is computed on the base of the completed part of I. We have already set I(D ) for each PSM class D preceding C. The situation is depicted in Figure 4 (a) with predecessors in the part with grey background. S adj class (C,C) is a combination of the distances between C and PIM classes which are interpretations of the predecessors of C.Weuse μ(c, D) to denote the distance between PIM classes C and D. The idea is depicted in Figure 4 (b). Algorithm 1 is only a skeleton which needs to be supplemented with particular methods for (1) measuring distances between PIM classes (i.e. suitable metric), (2) combining distances, and (3) selecting predecessors of C. In this paper,... we supplement the skeleton with basic methods to show that the general idea works. For measuring the distance between two PIM classes C and D, we use the length of the shortest PIM path connecting C and D. As the distance combination method, which results in the aimed S adj class (C,C), we can also choose from various possibilities. In this paper, we use S adj class (C,C)=( n i=1 μ(c, I(D i )) )+1 n where D 1,..., D n are the selected predecessors of C. S adj class (C,C) is the average of the lengths of the shortest PIM paths between C and each I(D i ). Finally, we need to decide what predecessors of C will be selected to compute S adj class (C,C). In general, we can select all predecessors. However, this would result in a high time complexity. To restrict the search space, we use a heuristic that the impact of a predecessor D of C to the final similarity decreases with the growing distance of D from C. Therefore, we consider only the children of C which are the closest predecessors to C. This basic selection method can be extended by considering other close predecessors such as previous siblings of C. Another possibility is to consider leaves of the sub-tree of C. These possibilities have been recently discussed in [20]. We demonstrate the class similarity adjustment on the PSM class Employer from our sample PSM in Figure 2. From the previous iterations of the algorithm we already have I(Country) = Country, I(WorkExperience) = Job, and I(BusinessSector) = Field. Suppose the PIM class Company. Its distances from the PIM classes Country, Job, and Field are 2, 1, and 1, respectively. On the other hand, the distances of the PIM class Applicant are 2, 1, and 2, respectively. Therefore, the adjustment is higher for the PIM class Company then for Applicant. In this case, the adjustment helps. On the other hand, if the PSM class 311

110 BusinessSector is not present in the PSM, the adjustment can not distinguish between both PIM classes. 2) Building Attribute Interpretation: Interpretation of PSM attributes of C is computed at lines For a PSM attribute A attrs(c ), I(A ) can be any PIM attribute from A. Therefore, the algorithm measures the similarity between A and each PIM attribute A at line 9. Again, the similarity is a weighted sum of the initial similarity S init attr (A,A) and the reversed value of an adjustment to the initial similarity. The adjustment in this case is simply a distance between I(C ) and the class of A increased by 1. The algorithm then offers the list of PIM attributes sorted by the computed similarity to the expert who selects a correct interpretation of A at lines 11 and 12. If I(C ) class(a) then we have class(i(a )) = class(a) I(C )=I(class (A )) which is inconsistent with the condition (1) of Definition 4. Therefore, we need to create a new PSM class D C such that I(D )=class(a) and move A to attrs (D ) and put D as the beginning of content (C ) at lines ) Building Association Interpretation: Interpretation of PSM associations in content (C ) is computed at lines Suppose a PSM association R =(C 1,C 2) R. The algorithm simply puts I(R )=P where P is the PIM path connecting I(C 1) and I(C 2) with the minimal distance with respect to the chosen metric. This may of course be inaccurate and therefore a domain expert should check the constructed interpretation and, where necessary, change the represented PIM path. 4) Evolution of PIM: It may happen that there is no component in M suitable as an interpretation of a given component in M. A special case is when M is empty. There are two possible solutions of this situation. The expert can (1) leave the interpretation of the PSM component unspecified or (2) create a new component of M which will be the interpretation of the PSM component. The former would result into an inconsistency with Definition 4 which requires the interpretation to be a total function. Our full PSM metamodel introduces special constructs that also model XML elements and attributes but have no interpretation and can be therefore used in this case. The other solution does not violate the consistency. When the PSM component is a PSM class C, the algorithm creates a new PIM class C C with name(c) = name (C ). When it is a PSM attribute A, it creates a new PIM attribute A Awith name(a) =name (A ), class(a) = I(class (A )), and type(a) = type (A ). When the PSM component is a PSM association R, the algorithm creates a new PIM association R Rwith ends(r) = {I(parent (R ),I(child (R )). V. EXPERIMENTS In this section, we briefly present some experimental results on building interpretations of PSM classes. We have 60% 50% 40% 30% 20% 10% 0% w class 0 0,1 0,20,3 Figure 5. Global Precision Local Precision P G and P L for Leaf PSM Classes 0,4 0,50,6 w init-class 0,7 0,80,9 0,3 0,2 1 0,1 0 Figure 6. 70% 60% 50% 40% 30% 20% 10% 0% 0,9 1 0,8 0,7 0,6 0,5 0,4 P G for Inner PSM Classes 100% 90% 80% implemented the introduced method in our tool XCase 1 which was primarily intended for designing XML schemas from a created PIM. XCase can be downloaded with some experimental XML schemas. These also include the experimental XML schema used in this section. Let us suppose an actual PSM class C. Let the domain expert set I(C ) to a PIM class C. We measure the precision of the algorithm from two points of view. Firstly, we measure the position of C in the list of PIM classes offered to the expert sorted by their S class. We call this precision a global precision P G : P G =(( 1 order(c) 1 )/n ) 100 n C C where n denotes the size of C, n denotes the size of C, and order(c) denotes the order of C in the list. If there are more PIM classes with the same similarity to C, order(c) is the order of the last one. P G =0(resp. 1) if for each PSM class C, the selected PIM class was the last (resp. first). The global precision is not sufficient. When C is the first class, there can be other PIM classes before C which have

1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 w init-class 0,2 w class 0,1 0 0 0,1 0,2 0,3 0,4 Figure 7.

111 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 w init-class 0,2 w class 0, ,1 0,2 0,3 0,4 Figure 7. 0,5 0,6 0,7 0,8 0,9 1 P L for Inner PSM Classes 60% 50% 40% 30% 20% 10% 0% 90% 80% 70% their similarity to C close to S class (C,C). We therefore propose another metric called local precision which measures the amount of PIM classes with their similarity to C close to S class (C,I(C )). It is defined as P L =(( 1 close(c) 1 )/n ) 100 n C C where close(c) denotes the number of PIM classes with their similarity to C close to S class (C,C). The term close similarity can be defined in various ways. In this paper, we say that y is close to x if y (x 0.1,x+0.1). We used our algorithm to build an interpretation of the PSM depicted in Figure 3 against the PIM depicted in Figure 1. The PSM was directly constructed from EuropassSchema XML schema 2 which is an official EU XML standard for the employment domain. We tested various settings of weighting factors. We show only the results related to PSM classes. We distinguish leaf PSM classes, i.e. PSM classes with empty content, from inner PSM classes, i.e. PSM classes with a non-empty content. Figure 5 shows the global and local precision for leaf PSM classes when w class =1(it has no sense to consider the similarity adjustment as leaves have no children). Various settings of the weighting factor w init class are displayed at the horizontal axis. The highest global precision is a little above 0.5, i.e. the average position of I(C ) for a PSM class C is somewhere in the middle of the offered list. Moreover, the local precision shows that there were many PIM classes with their similarity to C close to S class (C,I(C )). The local precision grows with w init class, i.e. measuring similarity of the attributes of C with the attributes of PIM 2 V2.0.xsd classes helped. The precision of the algorithm is not very good for leaf PSM classes. This is also because we only used primitive similarity methods. Figures 6 and 7 show global and local precision for inner PSM classes, respectively. They show that we can reach much higher precision for inner PSM classes then for leaves. This is because we can exploit the similarity adjustments for the inner PSM classes. The former figure shows that we reached the highest global precision for w class [0.1, 0.3] and w init class [0.1, 1]. In other words, it shows that the similarity adjustment is important in our sample. This is because PSM classes are not very similar on their names and attributes. The other figure shows that there is only a small range of weighting factors where we can reach a good local precision. This range is around w class [0.2, 0.3] and w init class [0.5, 1]. VI. CONCLUSION In this paper, we studied mapping of XML formats to a conceptual diagram. We introduced a basic algorithm which allows to exploit various similarity measurement methods. The algorithm also incorporates a domain expert into the mapping process. In each iteration of the algorithm, the domain expert confirms the discovered mappings. The algorithm exploits this decision to adjust the similarities in following iterations. We have demonstrated on a simple experiment with a real XML format that this idea works. We have shown that the algorithm can be adapted by the chosen similarity measurement techniques. We can also choose what part of the mapping already approved by the expert is used to compute the similarity adjustments. In our future work, we will experiment with various XML schemas and possible adaptations of the algorithm. We will also consider 1:n mappings and investigate the possibility of incorporation of behavioral aspects of web services. We expect that these adaptations will help in certain cases but will fail in others. Therefore, we will also study if and how it is possible to adapt the algorithm during its runtime on the base of its history. ACKNOWLEDGMENT This work was supported in part by the Czech Science Foundation (GAČR), grant numbers P202/10/0573 and P201/09/0990 and in part by the grant SVV REFERENCES [1] J. Klímek and M. Nečaský, Integration and Evolution of XML Data via Common Data Model, in Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, March 22-26, New York, NY, USA: ACM, [2] M. Nečaský and I. Mlýnková, On Different Perspectives of XML Schema Evolution, in FlexDBIST 09. Linz, Austria: IEEE,

112 [3] W3C, Semantic Annotations for WSDL and XML Schema, Candidate Recommendation, Tech. Rep. January [4] M. Nečaský and J. Pokorný, Designing Semantic Web Services Using Conceptual Model, in Proceedings of SAC 08. ACM, 2008, pp [5] M. Nečaský, Conceptual Modeling for XML, ser. Dissertations in Database and Information Systems Series. IOS Press/AKA Verlag, January 2009, vol. 99. [6] J. Fong, S. K. Cheung, and H. Shiu, The XML Tree Model - toward an XML conceptual schema reversed from XML Schema Definition, Data Knowl. Eng., vol. 64, no. 3, pp , [7] M. R. Jensen, T. H. Møller, and T. B. Pedersen, Converting XML Data to UML Diagrams For Conceptual Data Integration, in In Proceedings of DIWeb01, [8] Y. Weidong, G. Ning, and S. Baile, Reverse Engineering XML, Computer and Computational Sciences, International Multi-Symposiums on, vol. 2, pp , [9] A. Yu and R. Steele, An Overview of Research on Reverse Engineering XML Schemas into UML Diagrams, in ICITA (2). IEEE Computer Society, 2005, pp [10] P. Rodríguez-Gianolli and J. Mylopoulos, A Semantic Approach to XML-based Data Integration, in ER 01: Proceedings of the 20th International Conference on Conceptual Modeling. London, UK: Springer-Verlag, 2001, pp [11] C. Reynaud, J.-P. Sirot, and D. Vodislav, Semantic integration of xml heterogeneous data sources, in In Proceedings of IDEAS 01. Washington, DC, USA: IEEE Computer Society, 2001, pp [12] P. T. T. Thuy, Y.-K. Lee, and S. Lee, DTD2OWL: Automatic Transforming XML Documents into OWL Ontology, in In Proceedings of ICIS 09. New York, NY, USA: ACM, 2009, pp [14] L. Xiao, L. Zhang, G. Huang, and B. Shi, Automatic Mapping from XML Documents to Ontologies, in CIT 04: Proceedings of the The Fourth International Conference on Computer and Information Technology. Washington, DC, USA: IEEE Computer Society, 2004, pp [15] J. Madhavan, P. A. Bernstein, and E. Rahm, Generic Schema Matching with Cupid, in In Proceedings of VLDB 01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp [16] S. Bergamaschi, S. Castano, and M. Vincini, Semantic integration of semistructured and structured data sources, SIGMOD Rec., vol. 28, no. 1, pp , [17] L. Palopoli, G. Terracina, and D. Ursino, DIKE: a system supporting the semi-automatic construction of cooperative information systems from heterogeneous databases, Softw. Pract. Exper., vol. 33, no. 9, pp , [18] P. Shvaiko and J. Euzenat, A survey of schemabased matching approaches, Journal on Data Semantics, vol. 4, pp , [Online]. Available: \url{http: //dx.doi.org/ / \ 5} [19] J. Euzenat and P. Shvaiko, Ontology matching. Heidelberg (DE): Springer-Verlag, [20] A. Algergawy, R. Nayak, and G. Saake, XML Schema Element Similarity Measures: A Schema Matching Context, in Proceedings of OTM 09. Berlin, Heidelberg: Springer- Verlag, 2009, pp [21] M. Nečaský, Reverse Engineering of XML Schemas to Conceptual Diagrams, in Proceedings of APCCM 09. Wellington, New Zealand: Australian Computer Society, January 2009, pp [22] J. Miller and J. Mukerji, MDA Guide Version 1.0.1, Object Management Group, 2003, pdf. [13] R. dos Santos Mello and C. A. Heuser, A Bottom-Up Approach for Integration of XML Sources, in Workshop on Information Integration on the Web, 2001, pp

113 Chapter 5 Evolution and Change Management of XML-Based Systems Martin Nečaský Jakub Klímek Jakub Malý Irena Mlýnková Published in the International Journal of Systems and Software, volume 85, issue 3, pages Elsevier, February DOI /j.jss ISSN Impact Factor: Year Impact Factor:

114 102

The Journal of Systems and Software 85 (2012) 683 707 Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software jo u rn al hom epage: www.elsevier.

Charles University in Prague, Malostranské nám.

115 The Journal of Systems and Software 85 (2012) Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software jo u rn al hom epage: Evolution and change management of XML-based systems Martin Nečaský, Jakub Klímek, Jakub Malý, Irena Mlýnková Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague, Malostranské nám. 25, Praha 1, Czech Republic a r t i c l e i n f o Article history: Received 25 February 2011 Received in revised form 20 July 2011 Accepted 18 September 2011 Available online 29 September 2011 Keywords: XML data modeling Model driven architecture XML schema evolution Propagation of changes a b s t r a c t XML is de-facto a standard language for data exchange. Structure of XML documents exchanged among different components of a system (e.g. services in a Service-Oriented Architecture) is usually described with XML schemas. It is a common practice that there is not only one but a whole family of XML schemas each applied in a particular logical execution part of the system. In such systems, the design and later maintenance of the XML schemas is not a simple task. In this paper we aim at a part of this problem evolution of the family of the XML schemas. A single change in user requirements or surrounding environment of the system may influence more XML schemas in the family. A designer needs to identify the XML schemas affected by a change and ensure that they are evolved coherently with each other to meet the new requirement. Doing this manually is very time consuming and error prone. In this paper we show that much of the manual work can be automated. For this, we introduce a technique based on the principles of Model-Driven Development. A designer is required to make a change only once in a conceptual schema of the problem domain and our technique ensures semi-automatic coherent propagation to all affected XML schemas (and vice versa). We provide a formal model of possible evolution changes and their propagation mechanism. We also evaluate the approach on a real-world evolution scenario Elsevier Inc. All rights reserved. 1. Introduction The extensible Markup Language (XML) (Bray et al., 2008) is used by many software systems today to represent and exchange data in a form of XML documents. One of the crucial parts of such systems are XML schemas which describe structure of the XML documents. Usually, a system does not use only a single XML schema, but a set of different XML schemas, each in a particular logical execution part. We can, therefore, speak about a family of XML schemas. Having a system which exploits a family of XML schemas, we face to the problem of XML schema evolution. The XML schemas may need to be evolved whenever user requirements or surrounding environment changes. A single change may influence zero or more XML schemas. Without a proper technique, we have to identify the XML schemas affected by the change manually and ensure that they are evolved coherently with each other. When the XML schemas have already been deployed, in the run-time environment there are also XML documents which might became invalid and need to be, therefore, modified appropriately. In this paper we focus only on a part of the problem described coherent evolution of XML schemas according to changing requirements (see our recent work (Malý et al., 2011) where we discuss the other part of the problem adaptation of underlying XML documents when their XML schemas evolve). We propose a technique based on the Model-Driven Development (MDD) (Miller and Mukerji, 2003) methodology. We consider modeling the XML schemas at two MDD levels platform-independent and platform-specific. First, the whole application data domain is modeled independently of the XML schemas in the form of a platform-independent schema. Then, each XML schema in the family is designed in the form of a platform-specific schema which is mapped to the platform-independent schema. It may be then automatically translated to an expression in a selected XML schema language, e.g. XSD (XML Schema Definition) (Thompson et al., 2004) or RELAX NG (Murata, 2002). The mappings of platformspecific schemas to platform-independent schema naturally support evolution management. A change is explicitly expressed as a change to the platform-independent schema or one of the platform-specific schemas. The mappings allow us to propagate the change between platform-independent and platform-specific levels semi-automatically and evolve the whole family of XML schemas coherently. Contributions. The key contributions of this paper are as follows: Corresponding author. addresses: necasky@ksi.mff.cuni.cz (M. Nečaský), klimek@ksi.mff.cuni.cz (J. Klímek), maly@ksi.mff.cuni.cz (J. Malý), mlynkova@ksi.mff.cuni.cz (I. Mlýnková). formal models for designing XML schemas at platformindependent and platform-specific MDD levels and a set of atomic operations for their evolution, /$ see front matter 2011 Elsevier Inc. All rights reserved. doi: /j.jss

116 684 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) proof of minimality and correctness of the set, mechanism for propagating changes invoked by the atomic operations between the MDD levels, specification of operations composed of the atomic ones and their propagation between the MDD levels, implementation of the proposed framework called exolutio (Klímek et al., 2011), and experimental demonstration of the completeness of the set of atomic operations and correctness of the propagation mechanism by applying exolutio in a real-world case study. Although there is other existing work in the area of schema evolution (as we will show in Section 8), the evolution problem has not yet been adequately solved (Hartung et al., 2011). We will show that the current approaches omit some important kinds of operations and provide an insufficient solution to the problem of the propagation of changes. And, last but not least, they do not introduce operations as a formal set of simple atomic operations which would allow authors of tools for schema evolution to build various moreuser friendly operations as compositions of the atomic operations. In this work, we introduce such formalism. Its main advantage is that it enables one to specify a new operation as a sequence of the atomic operations without the details of how the new operation is propagated to the other parts of the system. Our propagation mechanism ensures its correct propagation automatically. In this paper we combine and, in particular, extend our previous work in this area. Our technique for designing XML schemas at platform-independent and platform-specific levels was firstly proposed in Nečaský (2009) and later generalized in Nečaský and Mlýnková (2010). A basic implementation of a modeling tool based on the models was introduced in Nečaský et al. (2008). In this text we describe it in detail including its evolution extension and show its usage in real-world use cases. In Nečaský and Mlýnková (2009a,b) we proposed a five-level XML evolution framework which presents a general overview of the problem of XML schema evolution in the context of a whole software system consisting of various parts. In this paper we lay the theoretical basis of our approach. We provide a formal and detailed description of evolution operations and their propagation, prove minimality and the correctness of our approach and extend it with explanatory examples. In general, this paper expands on the results of our recent research with an emphasis on formal specification. In Yu and Popa (2005) the authors discussed two kinds of evolution approaches incremental and change-based approaches. An incremental approach enables a clear formal basis which ensures correctness and allows for simple evolutionary steps made by a designer. A change-based approach is suitable for cases when we are provided with two versions of the schema without the incremental evolutionary steps and we need to manage evolution of the data efficiently. In this work, we introduce an incremental approach based on a set of atomic operations. A designer incrementally performs particular atomic operations or operations comprising the atomic ones. Our technique continuously propagates the changes to affected schemas. Outline. The rest of the paper is structured as follows: In Section 2 we provide a motivating and running example. In Section 3 we describe the problem of XML schema evolution in the context of a whole software system and specify the selected part of the problem solved in this paper. In Section 4 we provide a formal specification of platform-independent and platform-specific levels for XML schema modeling. In Section 5 we extend the levels with a set of atomic operations. In Section 6 we describe the propagation mechanism of the atomic operations between the levels and show that the atomic operations together with the propagation mechanism form a minimal and correct evolution formalism. In Section 7 we show how the atomic operations form realistic composite operations. In Section 8 we compare our proposal with current related works. In Section 9 we introduce the implementation of the introduced evolution formalism called exolutio and its application in a real-world case study. We also evaluate our approach on the basis of this case study. Finally, in Section 10 we conclude and outline possible future work. 2. Motivating and running example As a demonstration of the problem of evolution management of XML schemas, let us consider a company that receives purchase orders and let us focus on a part of the system that processes purchases. Let the messages used in the system be XML messages formatted according to a family of different XML schemas. Consider the two sample XML documents in Fig. 1. The former one is formatted according to an XML schema specifying a list of customers. The latter one is formatted according to a different XML schema specifying purchase requests. There are also other XML schemas in the family (e.g. customer details, purchase responses, purchase transport details, etc.). All the XML schemas share the same data domain (purchasing goods). On the other hand, the same part of the domain may be represented in different XML schemas in different ways. For example, the concept of customer is represented in each of our sample XML schemas in a different way. On the right hand side, elements name and are present for a customer. On the left hand side, kinds of customers are distinguished (private <custlist version="1.3"> <cust> <name>martin Necasky</name> <address>vaclavske nam. 123, Prague</address> <phone> </phone> </cust> <cust> <name>department of Software Engineering, Charles University</name> <hq>malostranske nam. 25, Prague</hq> <storage>ke Karlovu 3, Prague</storage> <secretary>ke Karlovu 5, Prague</secretary> <phone> </phone> </cust> </custlist> <purchaserq version="1.0"> <bill-to>malostranske nam. 25, Prague</bill-to> <ship-to>ke Karlovu 3, Prague</ship-to> <cust> <name>department of Software Engineering, Charles University</name> < >ksi@mff.cuni.cz</ > </cust> <items> <item> <code>p045</code> </item> <item> <code>p332</code> </item> </items> </purchaserq> Fig. 1. Sample XML documents represented in a single XML system.

M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 707 685 and corporate customers). For private customers, elements name, address and phone are present.

117 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) and corporate customers). For private customers, elements name, address and phone are present. For corporate customers, elements name, different addresses (headquarters, storage and secretary), and phone are present. Let us consider a new user requirement that an address should no longer be represented as a simple string. Instead, it should be divided into elements street, city, zip, etc. Such a situation would require a skilled domain expert to identify all the schemas in the system which involve an address and correct them respectively. Apparently, in a complex system comprising tens or even hundreds of schemas, this is a difficult and error-prone task. Even identifying the affected parts of the schema is not an easy and straightforward process. For example, we may need to make the modification only for addresses that represent a place to ship the goods (which are the elements address and storage in the XML schema instantiated on the left-hand side of the figure and element ship-to on the right-hand side). We do not want to modify addresses that represent headquarters, etc. In the following text we show in detail that evolution management is a complex process that can be solved semi-automatically and, hence, efficiently and precisely if we provide a rigorous theoretical background and preserve nontrivial relations and meta-data. 3. XML evolution framework In our previous work (Nečaský and Mlýnková, 2009a), we introduced a framework for managing evolution of a software system which exploits XML technologies at different levels. An extended version of the framework is depicted in Fig. 2. As we can see, the framework can be partitioned both horizontally and vertically; in both cases its components are closely related and interconnected. The relations form the key concept of the evolution management, since they invoke the needs for change propagation. If we consider the vertical partitioning, we can identify multiple views of the system. In the framework we have depicted the three most common and representative views. The blue (leftmost) part covers an XML view of the data processed and exchanged in the system. The green (middle) part represents the storage view of the system, e.g. a relational view of the processed data which need to be persistently stored. Finally, the yellow (rightmost) part represents a processing view of the data, e.g. processing by sequences of Web Services described using BPEL scripts (WSBPEL, 2007) or various proprietary formats (e.g. Park and Park, 2008). If we consider the horizontal partitioning, we can identify five levels, each representing a different view of an XML system and its evolution. The lowest level, called extensional level, represents the particular instances that form the implemented system such as, e.g., XML documents, relational tables or Web Services that are components of particular business processes. Its parent level, called operational level, represents operations over the instances, e.g. XML queries over the XML data expressed in XQuery (Boag et al., 2007) or SQL/XML (ISO/IEC, 2006) queries over relations. The level above, called schema level, represents schemas that describe the structure of the instances, e.g. XML schemas or SQL/XML Data Definition Language (DDL). Even these three levels indicate problems related to XML evolution. For instance, when the structure of an XML schema changes, its instances, i.e. XML documents, and related queries must be adapted accordingly so that their validity and correctness is preserved respectively. In addition, if we want to preserve optimal query evaluation over the stored data, the storage model also needs to adapt respectively. What is more, as we have mentioned, in practice there are usually multiple XML schemas (families of XML schemas) applied in a single system, e.g. XML schemas for purchases, invoices, product catalogues, etc., i.e. multiple views of the common problem domain. Hence, such a change can influence multiple XML schemas, XML documents and queries. In general, a change at one level can trigger a cascade of changes at other levels. We call such sequences of adaptations change propagation. Considering only the three levels leads to evolution of each affected schema separately. However, this is a highly timeconsuming and error-prone solution since we need a domain expert who is able to identify all the affected schemas and propagate the changes. Therefore, we introduce two additional levels, which follow the MDD (Miller and Mukerji, 2003) principle, i.e. modeling of a problem domain at different levels of abstraction. As we have mentioned, the topmost one is the platform-independent level which comprises a schema in a platform-independent model (PIM schema). The PIM schema is a conceptual schema of the problem domain. It is independent of any particular data (e.g. XML or relational) or business process (e.g. Web Services) model. The level below, called platform-specific level, represents mappings of the selected parts of the PIM schema to particular data or business process models. For each model it comprises schemas in a platform-specific model (PSM schemas) such as, e.g., XSEM schemas (Nečaský, 2009) which model XML schemas, ER (Chen, 2002) schemas which model relational schemas, etc. Each PSM schema can be then automatically translated to a particular language used at the schema level and vice versa. Note that the latter direction allows for integration of incoming formats/applications into the given evolution framework. As we can see in Fig. 2, there are not only vertical relations between the levels, but the components of the system can also be horizontally related across the vertical partitions. A few examples are denoted by the red dashed arrows. For instance, there is a relation between an XML schema and its respective storage in a relational database. Similarly, an XML query can be evaluated by Fig. 2. Five-level XML evolution architecture.

118 686 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) translation into an SQL query. And, last but not least, a BPEL script can specify how an input SOAP message, i.e. an XML schema, is processed. In all the three cases a change in one of the ends of the relation influences the other. So, since we are considering completely different formats involving different constructs which do not have to correspond mutually using one-to-one relationship, the change propagation becomes a complex problem. But, having a hierarchy of models which interconnect all the applications and views of the data domain using the common PIM level, it can be done semiautomatically and much more easily. We do not need to provide a mapping from every PSM to all other PSMs, but only from every PSM to the PIM which is, in addition, quite natural. Hence, the vertical change propagation is realized using this common point. For instance, if a change occurs in a selected XML document, it is first propagated to the respective XML schema, PSM and, finally, PIM. We speak about an upwards propagation, in Fig. 2 represented by white arrows. It enables one to identify the part of the problem domain that was affected. Then, we can invoke the downwards propagation. It enables one to propagate the change of the problem domain to all the related parts of the system. In Fig. 2 it is denoted by grey arrows Selected part of the problem Apparently, the change propagation problem is not an easy task and cannot be covered in a single paper. In this paper we aim at one particular problem XML schema evolution. As we have shown in the motivating example, there is usually a whole family of XML schemas which are conceptually related to the problem domain of the system. When a designer needs to make a change in one of the XML schemas, the other XML schemas may be affected as well. We introduce an approach which is based on modeling the changes at PIM and PSM XML levels as highlighted in Fig. 3. It ensures that whenever a change is performed in a PIM schema, it is correctly propagated to the PSM schemas and vice versa. So it ensures consistency between the schemas when they are changed. In practice, this problem appears in two scenarios. The first scenario is when a designer creates new XML schemas which have not been deployed in a run-time environment yet. There are neither XML documents formatted according to the XML schemas, nor other developers or applications which would somehow use the XML schemas. In other words, there is no extensional level and no operational level. Because of the complexity of the task, the designer does not create XML schemas in a single linear process. Instead, (s)he iterates in several cycles before an acceptable version of the XML schemas is prepared to be deployed. (S)he starts each iteration with a selected part of the requirements and incorporates them into the XML schemas. For that, (s)he needs a mechanism which shows an impact of a next change to the unfinished XML schemas and which helps to adapt the XML schemas according to this change. No propagation to extensional or operational levels is necessary at this stage. The most frequent modifications to the XML schemas in this scenario will be, intuitively, creating new parts of the XML schemas. However, updating existing parts with their more detailed and elaborate variants will be frequent as well. This is because the designer will cover some of the requirements only briefly in the XML schemas in early iterations and will return to them in later iterations to finish them. In simpler cases, updating means changing properties of existing XML schema components (e.g. data type). In more complex cases, updating means removing old parts and replacing them with new but semantically equivalent and more elaborate parts. No backward compatibility of the new version needs to be preserved since there is neither extensional, nor operational level. The second scenario is adapting existing XML schemas which have already been deployed in a run-time environment. In this scenario it is necessary to consider the extensional and operational level as well, because there exist XML documents and applications which use the XML schemas. Such scenario usually occurs when new or changed requirements need to be implemented in the system (e.g. a legislative change). Due to backward compatibility the designer will probably not remove the existing parts of the XML schemas. If some part needs to be detailed (or, conversely, simplified), it will be extended with a new version, not replaced. The approach we introduce in the following sections is fully sufficient for the first scenario and partly also for the second scenario. For the second scenario, propagation to the extensional and operational level is also necessary. We described the propagation to the extensional level in Malý et al. (2011). The technique introduced generates an XSLT script which transforms XML documents from the old version of each affected XML schema to the new version. The propagation to the operational level is the matter of our future work. A careful reader might notice that we omitted the schema level in the above paragraphs. Our approach allows the designer to work only at the PIM and PSM levels and not to consider the schema level. This is because our introduced PSM level is equivalent to the schema level from the syntactical point of view. The PSM level has two purposes in addition to the schema level it provides a more user-friendly presentation of the XML schemas to the designer and extends the XML schemas with mappings to the conceptual schema at the PIM level. In Nečaský and Mlýnková (2010), we proved the equivalence formally. We also showed how a PSM schema may be automatically translated to an XML schema expressed in some XML Fig. 3. Five-level XML evolution architecture data representation.

M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 707 687 schema language and vice versa via the formalism of regular tree grammars.

119 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) schema language and vice versa via the formalism of regular tree grammars. However, this is beyond the scope of this paper. We will just keep this fact in mind. We exploit this theoretical result in Malý et al. (2011), where we do not generate XSLT scripts on the basis of differences between XML schemas but PSM schemas. We will similarly use it in this paper as well. We will present a set of operations for changing PSM schemas. Because of the equivalence, a change operation at the PSM level unambiguously and correspondingly describes a change in the modeled XML schema and it is not, therefore, necessary to explicitly convert it to a change specific for an XML schema expressed in some XML schema language. 4. Modeling for XML evolution As we have outlined, our framework enables one to manage evolution of a family of XML schemas by introducing platformindependent and platform-specific levels. In this section, we introduce both levels formally Platform-independent model A schema in the platform-independent model (PIM) models real-world concepts and the relationships between them without any details of their representation in a specific data model (XML in our case). As a PIM, we use the classical model of UML class diagrams (Object Management Group, 2007a,b). For simplicity, we use only its basic constructs: classes, attributes and binary associations. UML is widely supported by the majority of tools for data engineering and the XMI (XMI, 2009) standard is used for exchanging diagrams between them; it is, therefore, natural to use UML in our approach as well. Definition 4.1. A platform-independent schema (PIM schema) is a triple S = (S c, S a, S r ) of disjoint sets of classes, attributes, and associations, respectively. Class C S c has a name assigned by function name. Attribute A S a has a name, data type and cardinality assigned by functions name, type, and card, respectively. Moreover, A is associated with a class from S c by function class. Association R S r is a set R = {E 1, E 2 }, where E 1 and E 2 are called association ends of R. R has a name assigned by function name. Both E 1 and E 2 have a cardinality assigned by function card and are associated with a class from S c by function participant. We will call participant(e 1 ) and participant(e 2 ) participants of R. name(r) may be undefined, denoted by name(r) =. For a class C S c, we will use attributes (C) to denote the set of all attributes of C, i.e. attributes (C) = {A S a : class(a) = C}. Similarly, associations (C) will denote the set of all associations with C as a participant, i.e. associations (C) = {R S r : ( E R)(participant(E) = C)}. PIM schema components have usual semantics: a class models a real-world concept, an attribute of that class models a property of the concept, and, an association models a kind of relationships between two concepts modeled by the connected classes. A sample PIM schema modeling our sample domain of products being sold is depicted in Fig. 4. We display PIM schemas as UML class diagrams. We omit displaying data types of class attributes. When a cardinality of a class attribute or association endpoint is not displayed, it is 1..1 by default Platform-specific model A schema in the platform-specific model (PSM) describes how a part of the reality modeled by the PIM schema is represented with a particular XML schema. For each aimed XML schema a separate PSM schema is created. As a PSM we use UML class diagrams extended for the purposes of XML modeling. The extension is necessary because of several specifics of XML (such as hierarchical structure or distinction between XML elements and attributes) which cannot be modeled by standard UML constructs. Definition 4.2. A platform-specific schema (PSM schema) is a 5-tuple S = (S c, S a, S r, S m, C S ) of disjoint sets of classes, attributes, associations, and content models, respectively, and one specific class C S S c called schema class. Class C S c has a name assigned by function name. Attribute A S a has a name, data type, cardinality and XML form assigned by functions name, type, card and xform, respectively. xform(a ) {e, a}. Moreover, it is associated with a class from S c by function class and has a position assigned by function position within the all attributes associated with class(a ). Association R S r is a pair R = (E 1, E 2 ), where E 1 and E 2 are called association ends of R. Both E 1 and E 2 have a cardinality assigned by function card and each is associated with a class from S c or content model from S m assigned by function participant, respectively. We will call participant(e 1 ) and participant(e 2 ) parent and child and will denote them by parent(r ) and child(r ), respectively. Moreover, R has a name assigned by function name and has a position assigned by function position within the all associations with the same parent(r ). name(r ) may be undefined, denoted by name(r ) =. Fig. 4. PIM schema modeling the domain of selling products.

120 688 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 5. PSM schema modeling (a) XML format for purchase requests received from customers, (b) XML format for purchase responses sent to customers, (c) components shared by other PSM schemas. Table 1 XML attributes and XML elements modeled by PSM constructs. PSM construct Modeled XML construct C S c Complex content which is a sequence of XML attributes and XML elements modeled by attributes in attributes (C ) followed by XML attributes and XML elements modeled by associations in content (C ) A S a, where xform(a ) = a XML attribute with name name(a ), data type type(a ) and cardinality card(a ) A S a, where xform(a ) = e XML element with name name(a ), simple content with data type type(a ) and cardinality card(a ) R S r, where name(r ) /= XML element with name name(r ), complex content modeled by child(r ) and cardinality card(r ). If R content (C S ), R models a root XML element R S r, where name(r ) = Complex content modeled by child(r ) M S m and cmtype(m ) = sequence (or choice or set) Complex content which is a sequence (or choice or set, respectively) of XML attributes and XML elements modeled by associations in content (C ) Content model M S m has a content model type assigned by function cmtype. cmtype(m ) {sequence, choice, set}. The graph (S c S m, S r ) must be a forest 1 of rooted trees with one of its trees rooted in C S. For C S c, attributes (C ) will denote the sequence of all attributes of C ordered by position, i.e. attributes (C ) = (A i S a : class(a i ) = C i = position(a i )). Similarly, content (C ) will denote the sequence of all associations with C as a parent ordered by position, i.e. content (C ) = (R S i r : parent(r i ) = C i = position(r i )). We will call content (C ) content of C. With anc(x ) we will denote the set of all ancestor classes of a component X in S. To distinguish PIM components from PSM components, we strictly use a notation without the symbol for PIM components (e.g. class Purchase) and notation with the symbol for PSM components (e.g. class Purchase ). Before showing sample PSM schemas, we explain the semantics of the PSM constructs. We view a PSM schema S from two perspectives: grammatical and conceptual. From each perspective, the constructs have a different semantics. 1 Note that since S is a forest, we could model R directly as a pair of connected components. However, we use association ends to unify the formalism of PSM with the formalism of PIM. From the conceptual perspective, S is mapped to a PIM schema S and models the same part of the reality as S. More precisely, some classes, attributes and associations of S are mapped to some classes, attributes, and associations of S, respectively. These mapped components of S model exactly the same part of the reality as do their corresponding counterparts in S. The rest of S has no semantics from the conceptual perspective. From the grammatical perspective, S models an XML schema. Its components model XML attributes and XML elements, and their structure. We summarize XML constructs modeled by PSM constructs in Table 1. Formally, S unambiguously models a regular tree language which can be specified by a regular tree grammar (Murata et al., 2005). However, this formalism is not a part of this paper. For the details on the modeled regular tree language and formal proofs of unambiguity we refer to our previous work (Nečaský and Mlýnková, 2010), where we proved that our PSM is equivalent to regular tree grammars. In other words, it can be equivalently used as an XML schema language. We showed how a PSM schema can be unambiguously translated to an expression in a selected XML schema language and vice versa. The important consequence of our previous results for this paper is that we can abstract our evolution mechanism from particular XML schema languages and work only at the PSM level. If we put both perspectives together, the PSM schema S specifies how the corresponding part of the PIM schema S is represented in the XML schema. In other words, it specifies how a part of the

121 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) <purchaserq version="1.0"> <cust partner-code="pa1"> <name>martin Necasky</name> < >necasky@...</ ><address>malostranske nam. 25, Praha, Czech Republic</address> </cust> <items> <item tester="true"><code>p001</code><title>sample for testing</title></item> <item><code>p002</code><title>umbrella</title><price>100</price><amount>2</amount></item> </items> </purchaserq> Fig. 6. Sample purchase request represented in the XML format modeled by the PSM schema depicted in Fig. 5(a). real world modeled by S is represented in XML documents valid against the XML schema. Conversely, it specifies the semantics of the XML schema in terms of S, i.e. the semantics of a particular XML document in terms of the PIM schema. Three sample PSM schemas are depicted in Fig. 5. We display PSM schemas as UML class diagrams with some extended notation. First, they are displayed in a tree layout; attributes and associations are sorted in the order given by position. Second, attributes with XML form a are displayed with symbol. Third, sequence, choice and set content models are displayed as rounded boxes with an inner symbol..., or {}, respectively. From the conceptual perspective, our sample PSM schemas are mapped to a part of the PIM schema in Fig. 4. We display the components mapped to the PIM schema in the sea shell color. The mapping is intuitive 2 and we do not display it explicitly. The components which are not mapped are displayed in grey. For example, the PSM class Purchase in Fig. 5(a) is mapped to the PIM class Purchase. In other words, the semantics of Purchase is the same as the semantics of Purchase which models purchases. Similarly, attribute name of class Customer is mapped to name of class Customer. Association cust connecting classes Purchase and Customer is mapped to association makes connecting classes Purchase and Customer. On the other hand, PSM class Contact is not mapped to the PIM schema. In other words, it has no semantics from the conceptual point of view. Similarly, the association with Contact as child is not mapped. And, attribute version of Purchase is not mapped as well. These non-mapped components have no semantic meaning from the conceptual point of view. From the grammatical perspective, the PSM schema depicted in Fig. 5(a) models an XML schema for purchase requests sent by customers to our system. A sample XML document formatted according to this XML schema is depicted in Fig. 6. The hierarchical structure of the XML schema is modeled by the associations of the PSM schema. As can be seen from the example, each named association models an XML element whose cardinality is given by the child cardinality of the association. For example, association items which connects classes Purchase and Items models XML element items with cardinality Association item which connects classes Items and Item models XML element item with cardinality 1..*. Moreover, when such association is in the content of the schema class, it models root XML elements. In our case, association purchaserq models XML elements purchaserq which are root XML elements of the modeled XML format. An association without a name models only the nesting of XML content. For example, association ItemProduct which connects classes Item and Product does not model any XML element. It specifies that the XML content modeled by its child is a part of the XML content modeled by its parent. An attribute models an XML element or XML attribute depending on its XML form. An attribute 2 The reader may deduce it from their names which intuitively suggest the mapping. with XML form = a models XML attribute and is depicted by the additional An attribute with XML form = e models XML element and is depicted without any additional symbol. Again, cardinality is given by the attribute cardinality. For example, attribute version of class Purchase models a mandatory XML attribute version. Attribute name of class Customer models a mandatory XML element name which can be repeated. Sometimes, classes in one or more PSM schemas may share the same attributes and/or part of their content. Instead of repeating them at several places, we introduce structural representatives which allow for attribute and content reuse. If a class C in a PSM schema is a structural representative of another class D from the same or another PSM schema, C inherits the attributes and content of D. From the grammatical perspective, C models the same XML attributes as D followed by its own modeled XML attributes followed by XML elements modeled by D and, finally, followed by its own modeled XML elements. Definition 4.3. Let S = (S c, S a, S r, S m, C S ) be a PSM schema and C be a class from S c. C may be a structural representative of another class D in S c which is assigned to C by function repr (repr(c ) = D ). If repr(c ) is undefined, denoted by repr(c ) =, we say that C is not a structural representative of any class. Let repr * () = {} and repr * (C ) = {repr(c )} repr * (repr(c )) where C /=. It must hold that C /= repr * (C ). A structural representative C of repr(c ) is displayed as a class with a blue background and the name of repr(c ) above its own name. For example, class Product from Fig. 5(a) and class Product from Fig. 5(b) are both structural representatives of class ProductBase from the PSM schema depicted in Fig. 5(c). From the grammatical perspective they both model the same XML fragment as the latter one. Note that the PSM schema in Fig. 5(c) does not model any XML documents (because it does not have any named association going from the schema class and, therefore, does not model any root XML elements). It acts as an auxiliary PSM schema which contains components shared by other PSM schemas via the mechanism of structural representatives. In the rest of this section we further formalize the conceptual perspective. A formal model of the grammatical perspective is provided in Nečaský and Mlýnková (2010) and we omit it in this paper Formal model of conceptual perspective Formally, the conceptual perspective of a PSM schema is expressed as a mapping of the PSM schema to the PIM schema. Before we introduce the mapping, we introduce an auxiliary notion of a directed image of an association from a PIM schema which we use in the following definitions. Definition 4.4. Let R = {E 1, E 2 } be an association in a PSM schema S. The directed images of R are R E 1 = (E 1, E 2 ) and R E 2 = (E 2, E 1 ). We will denote the set of all directed images of S as S r, i.e. S r = {R E 1, R E 2 : R = {E 1, E 2 } S r }.

122 690 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Now, we are ready to introduce the formalism of mappings. We call the mapping of the PSM schema to the PIM schema interpretation of the PSM schema against the PIM schema. Definition 4.5. An interpretation of a PSM schema S against a PIM schema S is a partial function I : (S c S a S r ) (S c S a S r ) which maps a class, attribute or association from S to a class, attribute or directed image of an association from S, respectively. For X (S c S a S r ), we call I(X ) interpretation of X. I(X ) = denotes that X does not have an interpretation. In that case we will also say that X has an empty interpretation. An arbitrary interpretation of a PSM component would lead to inconsistencies between the semantics of the PIM schema and the semantics of the PSM schema given by the interpretation. This would result in ambiguities in the semantics of PSM schemas. For example, suppose the class Product and its attribute code from our sample PSM schema depicted in Fig. 5(a). Let the interpretation of Product be the PIM class Product. Therefore, code, from the conceptual perspective, belongs to Product. On the other hand, suppose that code is mapped to the PIM attribute code of PIM class Purchase. From this, code belongs to Purchase which is in contradiction with the previous conclusion. We, therefore, need the interpretation to meet certain rules which prevent these ambiguities. Before we introduce the rules, let us define the notion of interpreted context of a PSM component. Definition 4.6. Let X be a component of a PSM schema S. Let I be an interpretation of S against a PIM schema S. The interpreted context of X with respect to I is denoted intcontext(x ) and intcontext(x ) = X when X S c and I(X ) /= intcontext(x ) = C when X / S c or I(X ) =, where C is the closest ancestor class to X s.t. I(C ) /=. As the definition shows, the interpreted context of each PSM component X is X itself if it is a class with an interpretation. In other cases, it is the closest ancestor class to X. Let us demonstrate the notion of interpreted context on our sample PSM schema depicted in Fig. 5(a). The interpreted context of class Customer is class Customer itself (intcontext(customer ) = Customer ), because I(Customer ) /=. The interpreted context of attribute name of class Customer is class Customer as well (intcontext(name ) = Customer ), because Customer is the closest ancestor class to name which has an interpretation. And, for the same reason, the interpreted context of association connecting classes Customer and Partner is class Customer. On the other hand, class Contact does not have an interpretation (I(Contact ) = ). The closest ancestor class with an interpretation is class Customer. Therefore, intcontext(contact ) = Customer. Similarly, intcontext(itemtester ) = intcontext(itempricing ) = Item. And the same is for attributes, for example intcontext(tester ) = Item. Note that intcontext(x ) may be empty, i.e. intcontext(x ) =. In that case we will say that X does not have an interpreted context. Thus, having the notion of interpreted context, we are ready to introduce the rules. We now define the notion of consistent interpretation of a PSM schema against a PIM schema. Consistency ensures that the semantics of the PSM schema determined by the interpretation is consistent with the semantics modeled by the PIM schema. Definition 4.7. Let I be an interpretation of a PSM schema S against a PIM schema S. We say that I is consistent if the following rules are satisifed: ( C S c s.t. repr(c ) /= I(C ) /= )(I(C ) = I(repr(C ))) (1) ( A S a s.t. I(A ) /= )(intcontext(a ) /= I(A ) attributes(i(intcontext(a )))) (2) ( R S r s.t. I(child(R )) = I(intcontext(R )) = )(I(R ) = ) (3) ( R S r s.t. I(child(R )) /= intcontext(r ) /= ) (4) (I(R ) = (E 1, E 2 ) s.t. participant(e 1 ) = I(intcontext(R )) participant(e 2 ) = I(child(R ))) Condition (1) requires that a structural representative C of a class repr(c ) has the same interpretation as repr(c ). This is because C acquires the attributes and content of repr (C ). To ensure consistency, the attributes and associations in the content must semantically remain with C. Condition (2) requires that when an interpreted attribute A has an interpreted context C, then I(A ) must be an attribute of I(C ). In other words, A must semantically belong to the interpretation of its interpreted context. Conditions (3) and (4) ensure consistency of associations. Condition (3) requires that only an association with an interpreted child and interpreted context may have an interpretation. This is because the semantics of an association specifies how instances of the child of the association are connected to their interpreted context. For associations with interpretation, condition (4) is applied. It is similar to (2). If an association R has an interpreted context with interpretation C and its child has an interpretation D, the interpretation of R must be an ordered image of an association connecting C and D. Let us demonstrate conditions (2) (4) on the PSM schema depicted in Fig. 5(a). First, suppose attribute tester. Its interpreted context is class Item with I(Item ) = Item. Condition (2) requires that I(tester ) attributes (Item). This is satisfied in our case because I(tester ) = tester. Second, suppose the association connecting classes Customer and Contact. Since I(Contact ) =, condition (3) requires that the association does not have an interpretation. This is natural, because both Contact represents a part of class Customer from the conceptual perspective and, therefore, it is meaningless to specify the semantics of the association. On the other hand, the association connecting classes Customer and Partner must have an interpretation, because both classes have an interpretation and it is necessary to specify the semantics of the connection between them. The interpretation must be an association connecting Customer and Partner according to condition (3). In our case it is the association responsibility which is correct. The following lemma shows that Definition 4.7 is correct. Lemma 4.1. Let I be a consistent interpretation of a PSM schema S against a PIM schema S. The semantics of each component of S specified by I is unambiguous. Proof. We will show that the semantics of each PSM class, attribute or association in S is specified unambiguously by I. Without loss of generality, we will consider components of S which are semantically related to a PIM class C S c. First, let C S c s.t. I(C ) = C. The semantics of C is specified by I unambiguously from Definition 4.5. There is no way to use I to deduce that the semantics of C is a class C 0 /= C. Second, let A S a s.t. I(A ) /=. Let C a S c s.t. C a = class(a ) or repr(c a ) = class(a ). Let I(intcontext(C a )) = C. If I(A ) / attributes (C), the semantics of A is ambiguous. From the conceptual perspective, A semantically belongs to C on one hand and it does not on the other. However, I(A ) / attributes (C) cannot happen because of conditions (1) and (2). Third, let R S r s.t. I(R ) /= is a directed image of an association R S r. Let C r S c s.t. C r = parent(r ) or repr(c r ) = parent(r )

123 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) or C r = child(r ) (in this last case, condition (4) ensures that I(C r ) /= and, therefore, intcontext(c r ) = C r ). Let I(intcontext(C r )) = C. If R / associations (C), the semantics of R is ambiguous. From the conceptual perspective, R is an association connected to C on one hand and it is not on the other. However, R / associations (C) cannot happen because of conditions (1) and (3). In the rest of this paper, each interpretation considered will be consistent; we do not consider inconsistent interpretations. In the following section, we introduce atomic operations which allow for modifications of PIM and PSM schemas. It is clear that the consistency of an interpretation may be corrupted when the PIM or PSM schema is changed. For example, removing an attribute in the PIM schema may break condition (2) of Definition 4.7 and reconnecting an association in the PSM schema may break conditions (3) and (4). Our aim is to introduce a mechanism which propagates performed changes from the PIM schema to the PSM schema and vice versa so that the consistency of the interpretation is ensured. 5. Atomic operations In this section, we introduce atomic operations for editing PIM and PSM schemas. They are not intended to be used directly by the designer, because they are too primitive and using them would be too laborious and clumsy for the designer. However, they will serve us as a formal basis for describing more user-friendly operations composed of these atomic operations. In Section 7 we will describe composite operations. In Section 6 we will describe how operations are propagated between PIM and PSM levels to ensure the consistency of corrupted interpretations. Formally, we suppose a PIM schema S = (S c, S a, S r ) and a set of PSM schemas PSM = {S 1,..., S n }, where each S has an interpretation I i against S. We also consider one specific PSM schema S = i (S c, S a, S r, S m, C S ) from this set with an interpretation I against S. For each atomic operation, we specify its input parameters together with a precondition and postcondition. If a precondition is not satisfied, the operation cannot be performed. The postcondition describes the effect of the operation. When an operation is executed on S or S, we say that the schema evolved to a new version. This is denoted S + or S +, respectively. The new version of the interpretation will be denoted I +. Initially, we suppose a single empty PIM schema and empty PSM. The PIM schema cannot be removed. On the other hand, a new PSM schema with an interpretation against the PIM schema may be created and later removed. We classify atomic operations into 4 categories: creation (denoted by the Greek letter ), update (denoted by the Greek letter ), removal (denoted by the Greek letter ı) and synchronization (denoted by the Greek letter ). While the creation, update and removal operations are common in the literature, the synchronization operations have not been considered and are novel in our approach. They are crucial for the evolution. A synchronization operation allows for the specification that two sets of components are semantically equivalent. Consider a simple scenario with a class Customer which models a concept of customer. Customer s address is modeled with attribute address. Later, users require a more precise specification of address including street, street number and city. Therefore, the designer needs to replace address with new attributes street, streetno and city. According to existing approaches, this means creating the new attributes and removing the old one. However, this leads to loosing the information that the old attribute is semantically equivalent to the new set. Without this information, the performed change cannot be correctly propagated as we will show later. This is the reason why we propose synchronization operations. We use them to specify that address is semantically equivalent to the set {street, streetno, city}. Definition 5.1. Let X 1 and X 2 be two sets of components from the same PIM or PSM schema. We use predicate equiv(x 1, X 2 ) to denote that X 1 and X 2 are semantically equivalent. It means that X 1 models the same information as X Atomic operations for PIM schema evolution We start with atomic operations for evolution of PIM schemas. The operations for creating new components are summarized in Table 2: their semantics is clear and so we provide no further description. Let us just note that the name, data type, and cardinality of created components are set to default values configured by the schema designer. The operations for updating components are summarized in Table 3. There are two update operations which merit a more detailed explanation moving an attribute A from its current class to another class ( class a ) and reconnecting an association end E from its current class to another class ( class r ). The preconditions of both Table 2 Atomic operations for creating new PIM components. Notation Description Precondition Postcondition C 1, C 2 S c C = c() Create class C with default name l c true C (S + c \ S c) name + (C) = l c A = a(c) + A (Sa Create attribute A with default name, C S S a) class + (A) = C name + (A) = l a c type + (A) = t a card + (A) = c a type and cardinality l a, l t, and l c R = r(c 1, C 2) Create association R with default name R = {E 1, E 2} (S + r \ S r) name + (R) = l r participant + (E 1) = C 1 participant + (E 2) = C 2 and cardinalities l r and c r card + (E 1) = c r card + (E 2) = c r Table 3 Atomic operations for updating PIM components. Notation Description Precondition Postcondition name c (C, v) Update name of class to v C S c name + (C) = v (A, v) name type card a name + (A) = v, type + (A) = v Update name, type, or A S a or card + (A) = v cardinality of attribute to v class A S a C S c a (A, C) Move attribute to class C associations(class(a), C) /= class + (A) = C name r (R, v) Update name of association to v R S r name + (R) = v class r (E, C) C S c ( R S r)(e R) Reconnect association end to associations(participant(e), C) /= class C participant + (E) = C card r (E, v) Update cardinality of association end to v ( R S r)(e R) card + (E) = v

692 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 70

124 692 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 7. Evolution of a sample PIM schema demonstrating the introduced creation, update, removal and synchronization atomic operations. Table 4 Atomic operations for removing PIM components. Notation Description Precondition Postcondition ı c(c) Remove class C C S c attributes(c) = associations(c) = C / S + ı a(a) Remove attribute A A S a A / S + ı r(r) Remove association R R S r R / S + c operations require that the current and new class are connected by an association (associations(c 1, C 2 ) denotes all associations connecting classes C 1 and C 2 ). Therefore, it is not possible to move an attribute or reconnect an association end between classes which are only connected by a path of associations or are not connected at all. However, it is possible to create a composite operation from the atomic operations which allows for moving the attributes and reconnecting association ends freely. It can perform atomic moves or reconnections in case where there is a connecting path. And, it can create a temporary association connecting the classes in case there is no connection at all. The operations for removing components are summarized in Table 4; however, the class removal operation (ı c ) requires the removed class to have no attributes and connected associations, so all attributes and associations connected to the class must be removed before removing the class itself. And, finally, the operations for synchronizing components are summarized in Table 5. We introduce two operations synchronization of two sets of attributes and synchronization of two sets of associations. The precondition of the former synchronization operation requires the attributes from both sets to belong to the same class. It is not restrictive. It is possible to have two synchronized sets of attributes, where each attribute is in a different class. However, we need to perform a sequence of atomic operations this consists of moving the attributes to the same class, synchronization and moving them back to their original classes. Similarly, the precondition of the other operation needs the associations to connect the same two classes. Again, it is not restrictive, because other cases may be achieved by performing a sequence of atomic operations. The reader might notice that we do not provide an operation for synchronizing classes. An operation for synchronizing a mixture of classes, attributes and associations is missing as well. Our preliminary case studies (one of the provided in Section 9) show that class synchronization is not necessary as classes do not model data but only encapsulate them. Synchronization of a mixture of components would be, theoretically, necessary, but too complex and unnatural for common designers. Therefore, in the current version of our technique we try to manage the evolution without these advanced synchronization operations. We leave this scientifically interesting issue to our future work. A sample evolution is depicted in Fig. 7. Fig. 7(a) shows a starting PIM schema. It is a fragment of the PIM schema depicted in Fig. 4. It contains two classes Customer and Partner which model customers and partners, respectively. Partners are responsible for customers which is modeled by the relationship responsibility. First, there is a requirement to not further consider partners. Therefore, class Partner needs to be deleted by operation ı c (Customer). It is necessary to perform ı a (code) and ı r (responsibility), which delete the attribute code and association responsibility, prior to ı c (Customer). The result is depicted in Fig. 7(b). Table 5 Atomic operations for synchronization of PIM components. Notation Description Precondition Postcondition a(x 1, X 2) Synchronize set of attributes X 2 with set of attributes X 2 X 1 S a X 2 S a ( C S c)(x 1, X 2 attributes(c)) r(x 1, X 2) Synchronize set of associations X 2 with set of associations X 2 X 1 S r X 2 S r ( C 1, C 2 S c)(x 1, X 2 associations(c 1, C 2)) equiv + (X 1, X 2) equiv + (X 1, X 2)

125 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Table 6 Atomic operations for creating PSM schemas and their components. Notation Description Precondition Postcondition (S, I) = s (S) Create new PSM schema S with interpretation I against S true C = c () Create new class C with true default name l c A = a (C ) Create new attribute A with default name, type, XML form and cardinality l a, t a, x a, and c a R = r (X, X ) Create new association R with 1 2 default name and cardinalities l r and c r C S c M = m () Creates new sequence content model M true X 1 (S c S m ) X 2 (S c S m ) \ {C S } ( R 0 )(child(r 0 ) = X 2 ) S = ({C S },,,, C S ) S (PSM + \ PSM) I = {(C S, )} C (S c + \ S c ) I + (C ) = name + (C ) = l c A (S a + \ S a ) class + (A ) = C I + (A ) = name + (A ) = l a type + (A ) = t a xform + (A ) = x a card + (A ) = c a R (S r + \ S r ) parent(r ) = X 1 child(r ) = X I + (R ) = 2 name + (R ) = l r card + (R ) = c r position(r ) + = content (C ) M (S m+ \ S m ) cmtype(m ) = sequence Second, there is a requirement to consider customer s addresses in more detail. Currently, it is modeled by attribute address of class Customer. The aim is to model addresses as depicted in Fig. 7(h). The evolution is iterative. Particular iterations are depicted in Fig. 7(c) (g). The designer starts with modeling addresses with three separate attributes street, city and country instead of the original attribute address. For this, (s)he creates the attributes (street = a(customer),...), changes the default values of their names and data types when necessary ( name a (street, street ),...), synchronizes them with the original attribute ( a ({address}, {street, city, country})) and, finally, removes the original attribute (ı a (address)). The synchronization is important. It specifies that the new attributes are semantically equivalent with the old one. The whole sequence of performed atomic operations can be viewed as splitting the original attribute into the three new ones. Note that the precondition for synchronization is satisfied (all attributes are in the same class). The result is depicted in Fig. 7(c). Later, the designer notices that (s)he forgot to include GPS information. (S)he needs to extend the three attributes street, city and country with a new attribute gps. For this, (s)he creates the new attribute and synchronizes the original set of attributes modeling address with the new set which is the original set extended with gps ( a ({street, city, country}, {street, city, country, gps})). The result is depicted in Fig. 7(d). Now, class Customer contains too much information and the designer wants to make it more transparent. Therefore, (s)he decides to move attributes street, city and country to a separate class Address. The class is not present and (s)he, therefore, needs to create it and update its name (Address = c(), name c (Address, Address )). (S)he also needs to connect it with Customer by creating a new association address (address = r(customer, Address),...). Then, (s)he can move the attributes to the new class ( class a (street, Address),...). The old and new class are connected by an association and, therefore, the precondition for moving the attributes is satisfied. The result is depicted in Fig. 7(e). (S)he also needs to detail gps to latitude and longitude and move them to a separate class GPS. Therefore, she performs a similar sequence of operations as for the former address attribute. And, finally, (s)he needs to extend customers to have one or two addresses instead of one. (S)he, therefore, changes the cardinality of association address to 1..2 ( card r (address2, 1..2)), where address2 is the endpoint associated with Address. The result is depicted in Fig. 7(f). In the following step, the designer gets a requirement to explicitly distinguish the semantics of the two addresses to a mandatory shipping address and optional billing address. Therefore, (s)he splits the association address to two new associations shipto and billto. As with the splitting of attributes, this Table 7 Atomic operations for updating PSM components. Notation Description Precondition Postcondition name c (C, v) Update name of class C to v C S c name + (C ) = v repr c (C, C r ) Set class C C S c as structural representative {C S } (C r (C r S c \ {C S } of C r I(C ) = I(C r ) C / repr (C r ))) repr + (C ) = C r cmtype m (M, t) Update type of content model M M S m t {sequence, choice, set} cmtype + (M ) = t name type a (A, v) Updates name, type, cardinality, or XML form of attribute to v card xform a (A, v) A S a pos a (A ) Changes position of attribute A by 1 position(a ) > 1 class a (A, C ) Move attribute A to class C name card r (R, v) Update name or cardinality of association R to v A S a C S c (repr(class(a )) = C class(a ) = repr(c ) class(a ) = parentclass(c ) C = parentclass(class(a ))) R S r pos r (R ) Change position of association R by 1 position(r ) > 1 class r (R, P ) Reconnect parent association end of association R to new parent P R S r P S c S m (repr(parent(r )) = P parent(r ) = repr(p ) R p S r which connects parent(r ) and P ) name + (A ) = v, type + (A ) = v, card + (A ) = v or xform + (A ) = v class + (A ) = C name + (R ) = v or card + (R ) = v position + (R ) = position(r ) 1 parent + (R ) = P

126 694 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Table 8 Atomic operations for updating interpretations. Notation Description Precondition Post... int c (C, C) Update interpretation of class C to class C int a (A, A) Update interpretation of attribute A to attribute A int r (R, O) Update interpretation of association R to directed image O of association R C S c \ {C S } (C = C S c) ( A S a s.t. intcontext(a ) = intcontext(c ) C anc(a ))(I(A ) = ) ( R S r s.t. (intcontext(r ) = intcontext(c ) C anc(r )) child(r ) = C ) I + (C ) = C (I(R ) = ) ( C S 0 c )(repr(c ) /= C ) repr(c ) = 0 A S a (A = (A S a class(a) = I(intcontext(A )))) I + (A ) = A R S r child (R ) S c ( O = (O = (E 1, E 2) participant(e 1) = I(intcontext(R )) participant(e 2) = I(child(R )) )) I + (R ) = O entails creating two new associations (shipto = r(customer, Address),...), changing their default names and cardinalities ( name r (shipto, shipto ),...), synchronizing the old association with the new ones ( r ({address}, {shipto, billto})), and removing the old association (ı r (address)). Note that the synchronized associations connect the same classes and, therefore, preconditions for the synchronization are satisfied. The result is depicted in Fig. 7(g). Finally, there appears a requirement to record GPS information for each address instead of a customer. For this, the designer reconnects the association gps from class Customer to class Address ( class r (gps1, Address)), where gps1 is the endpoint associated with Customer. Again, the precondition for the reconnection is satisfied, because the classes are connected by an association. The result is depicted in Fig. 7(h) Atomic operations for PSM schema evolution In this section, we introduce atomic operations for evolution of PSM schemas. The operations for creating new components are summarized in Table 6: there is also an operation for creating PSM schemas themselves. Again, names, data types, XML forms and cardinalities of new components are set to default values which are configured by the designer. All components are created with an empty interpretation against the PIM schema. The operations for updating components are summarized in Table 7. Similar to the operations for updating PIM components, there are two interesting operations moving an attribute ( class a ) and reconnecting an association end ( class r ). Both are similar to their PIM equivalents but there are some differences. An attribute can be moved to the nearest ancestor or descendant class of its current class (parentclass(c ) denotes the nearest ancestor class to C ). It can also be moved to a structural representative of its current class or, conversely, to a class which is a structural representative of its current class. For an association, only its parent association end can be reconnected to the parent or to any child of its current parent. When its current parent is a class, it can also be reconnected to a structural representative of the current parent or, conversely, to a class which is a structural representative of its current parent. The operations for updating interpretations are summarized in Table 8. Their preconditions ensure that the consistency of interpretation is not violated. Concretely, the operation for updating class interpretation ( int c ) could violate any of the conditions necessary for consistency. However, its precondition requires that the interpretation of any attribute or association, whose consistency would be corrupted by the update, must be empty. This includes all attributes and associations which have the same interpreted context as the class (anc(x ) used in the precondition which denotes all ancestor classes of X ). Also, the class can not be a structural representative and, conversely, it cannot have a structural representative. Therefore, the conditions can not be violated. The operation for updating attribute interpretation ( int a ) could affect condition (2) and the operation for updating association interpretation ( int r ) could affect conditions (3) and (4). Their preconditions prevent any violations. (They are directly rewritten from the definition.) The operations for removing components of PSM schemas are listed in Table 9. Their functionality is quite clear. Let us note that we can only remove classes and content models that are empty and are roots of their PSM schema. Also, we can only remove associations, whose removal does not violate Definition 4.7. When there are attributes or associations in the subtree of R with the same interpreted class context as R and with non-empty interpretations, we cannot remove R. To correct the schema, we would need to set empty interpretations to these attributes and associations, which is not an atomic operation. Note that when we remove an association going to a class or content model, this class or content model becomes a root. And, finally, the operations for synchronizing two sets of PSM components are listed in Table 10. Similar to their PIM equivalents, they allow for synchronization of two sets of attributes and two sets of associations. The operation for attributes corresponds to its PIM equivalent. The operation for associations is also similar. However, it is not possible to require the associations to have the same Table 9 Atomic operations for removing PSM schemas and their components. Notation Description Precondition Post... ı s (S ) Remove existing PSM schema S and its interpretation I against S S PSM S = (S c, S a, S r, S m, C S ) S a = S r = S m = S c = {C S } ı c (C ) Remove class C C S c attributes(c ) = content (C ) = ( C S 0 c)(repr(c ) = C ) 0 C / S + ı a (A ) Remove attribute A A S a A / S + ı r (R ) Remove association R R S r ( X (S a S r : intcontext(x ) = intcontext(r ) R anc(x ))(I(X R / S + ) = ) ı m (M ) Remove cont. model M M S m content (M ) = M / S + S / PSM +

M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 707 695 Fig. 8. Evolution of a sample PSM schema demonstrating the introduced creation, update, removal and synchronization atomic operations.

127 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 8. Evolution of a sample PSM schema demonstrating the introduced creation, update, removal and synchronization atomic operations. participants (because of the tree nature of PSM schemas). Instead, we require that they have one of their participants in common: that is the child of none or one of the associations and the parent of the others. The other participants must be different classes but with the same non-empty interpretation. In other words, these other participants are semantically equivalent (they have the same class in the PIM schema as their interpretation). Therefore, the operation also corresponds to its PIM equivalent. Note that similar to synchronization in a PIM schema, the expression equiv + (X 1, X 2 ) = true in the postconditions of both operations denotes that X 1 and X 2 are synchronized in the new version of the PSM schema. We demonstrate the operations in Fig. 8. Fig. 8(a) shows a starting PSM schema. It has an interpretation against the PIM schema depicted in Fig. 7(a). The PIM schema evolves as we have demonstrated, so the consistency of the interpretation of the PSM schema is broken. In this example, we show how the PSM schema and its interpretation can be adapted using the introduced atomic operations to ensure the consistency. Fig. 8(b) (h) show particular evolutionary steps which result from changes at the PIM level demonstrated in Fig. 7. First, the designer needs to remove class Partner (ı c (Partner )) to reflect the first change in the PIM schema (removing class Partner). Prior to this, (s)he removes attribute code (ı a (code )) and both associations partner and customer (ı r (partner ), ı r (customer )). Then, (s)he creates a new association customer connecting the schema class and class Customer (customer = r (CustomerDetailSchema, Customer )) and sets its name ( name r (customer, customer )). The association has an empty interpretation. The result is depicted in Fig. 8(b). Table 10 Atomic operations for synchronization of components of PSM schemas. Notation Description Precondition Postcondition a (X 1, X 2 ) Synchronize set of attributes X 2 with set of attributes X 1 r (X 1, X 2 ) Synchronize set of associations X 2 with set of associations X 1 X 1 S a X 2 S a ( C S c )(X 1, X 2 attributes(c )) X 1 S r X 2 S r ( C 1 S c, C 2 S c {})( R X 1 X 2 )( (C 1 = parent(r ) child(r ) S c (I(child(R )) = C 2)) (C 1 = child(r ) parent(r ) S c (I(parent(R )) = C 2))) equiv + (X 1, X 2 ) equiv + (X 1, X 2 )

128 696 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Second, address was split into three new attributes in the PIM schema. The designer correspondingly needs to split attribute address into three new attributes street, city, and country in the PSM schema. (S)he creates the attributes (street = a (Contact ),...) and sets their names ( name a (street, street ),...) and interpretations ( int a (street, street),...). Then, (s)he synchronizes the new attributes with the original one ( a ({address }, {street, city, country })). Note that the preconditions of both setting interpretations and synchronization is satisfied the attributes are in the respective interpreted context and are within the same class. Finally, (s)he removes the old attribute address (ı a (address )). The result is depicted in Fig. 8(c). The designer then proceeds with extending the three new attributes with a new attribute gps. (S)he creates the attribute (gps = a (Contact ),...) and sets its name ( name a (gps, gps ),...) and interpretation ( int a (gps, gps),...). Then, (s)he specifies that the three original attributes are semantically equivalent to the extension ( a ({street, city, country }, {street, city, country, gps })). The result is depicted in Fig. 8(d). Third, the designer needs to separate attributes street, city, and country to a new class Address. (S)he creates the new class (Address = c ()) and sets its name ( name c (Address, Address )) and interpretation ( int c (Address, Address)). Then, (s)he connects the new class with Contact by creating a new association (address = r (Contact, Address )). (S)he sets its name ( name r (address, address )) and interpretation ( int r (address, address)). Now, the preconditions allow for moving the attributes from Contact to Address ( class a (street, Address ),...). The designer moreover needs to specify that a customer has one or two addresses ( card r (address, 1..2)). The result is depicted in Fig. 8(e). Later, (s)he similarly moves the attribute gps to a new class GPS and splits it into two new attributes longitude and latitude as depicted in Fig. 8(f). Fourth, the PIM schema now distinguishes two different addresses shipping and billing. The designer needs to reflect this change in the PSM schema by splitting the association address correspondingly. However, the change is more complex than in case of the PIM schema, because the resulting PSM schema must be a tree. (S)he first needs to create new classes ShipAddr and BillAddr, set their names, and set their interpretation to PIM class Address. Now (s)he may split address. (S)he creates two new associations shipto and billto connecting Contact with ShipAddr and BillAddr, respectively. (S)he sets their names and interpretations to PIM associations shipto and billto, respectively. (S)he also sets cardinality of billto to Then, (s)he synchronizes the original association with the new ones ( r ({address }, {shipto, billto })) and removes the original one (ı r (address )). (S)he wants both new addresses to model the same XML fragments as the original one and (s)he, therefore, sets ShipAddr and BillAddr as structural representatives of Address ( repr c (ShipAddr, Address ),...). In the final step, the designer needs to reflect in the PSM schema reconnecting the association gps in the PIM schema. The impact of this change to the PSM schema is that both, shipping and billing address have GPS information. Therefore, the designer needs to reconnect gps association to class Address. This requires two atomic reconnections of gps. First, from class Contact to class ShipAddr ( class r (gps, ShipAddr )) and then to Address ( class r (gps, Address )). Note that both reconnections are allowed by the operation precondition. In the first case the reconnection is between classes connected by an association. In the other case, the reconnection is between a structural representative and its referenced class. 6. Propagation of atomic operations According to Section 4.3, an interpretation of a PSM schema S against a PIM schema S must be consistent. When S or S is modified by an atomic operation, one or more conditions necessary for consistency may be violated and, consequently, the interpretation or the other schema must be adapted accordingly. We call the process which ensures the adaptation propagation of the atomic operation. In the example in the previous section we showed how a designer can solve this issue manually (our designer performed a sequence of operations in the PIM schema and then (s)he needed to perform similar steps in the PSM schema). In this section, we show how the propagation can be automated. If we consider the fact that there may be many PSM schemas affected, automation is very helpful Propagation from PIM to PSM level In this section, we describe how introduced atomic operations executed on the PIM schema S are propagated to each PSM schema S PSM and its interpretation I against S. We will demonstrate the propagation on our sample PIM and PSM schema evolution depicted in Figs. 7 and 8, respectively. We suppose that the designer manually changes the PIM schema in the steps depicted in Fig. 7. In Section 5.2 we showed in Fig. 8 how the designer manually adapts the PSM schema according to the changes in the PIM schema. In this section we show that our propagation mechanism is able to adapt the PSM schema automatically which reduces the designer s manual work. Let us start with propagating the creation operations. Creating a new component X in S does not automatically imply the existence of any component in S. This is because the creation does not violate Definition 4.5. Moreover, X models a new part of the reality which has no representation in the PSM schemas, where its creation could be propagated. It is up to the designer, whether to create new components in the PSM schemas which represent this new part of the reality, or not. Therefore, the creation operations are not propagated. Let us consider the evolution of our sample PIM schema depicted in Fig. 7(e). Here, the designer first created a new class Address and association address. These operations on their own do not automatically result in creating new classes and associations in the PSM schemas. It is up to the designer whether to propagate them, or not. For example, (s)he later decides to move some attributes from Customer to Address. In that case it is necessary to create new classes with Address as their interpretation in the PSM schemas, because we need to correspondingly move attributes in the PSM schemas. An update of a component X of S may have an impact on each component X in the PSM schema with I(X ) = X and its propagation may be necessary. More specifically, an update of the name of X is propagated to an optional update of the name of X. This is because X and X do not necessarily need to share the same name. On the other hand, an update of the type or cardinality of X is propagated to a mandatory update of the type or cardinality of X. In our sample PIM schema evolution depicted in Fig. 7(f), the cardinality of the association endpoint of association address connected to class Address was updated from 1..1 to This is automatically propagated by our mechanism to all associations in the PSM schemas with address as an interpretation. For example, it is propagated to association address depicted in Fig. 8(f). The propagation of the two remaining update operations, i.e. moving an attribute and reconnecting an association end, is more complex. Both operations modify the structure of S which may break the consistency of the interpretation. The impact on the structure of S may be quite extensive and it would be almost impossible for the designer to manage the impact manually. The

129 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) idea of propagation is similar for both operations even though reconnecting an association end is technically more complicated. However, we will discuss only the first one. Suppose that class a (A, D) was performed. In other words, an attribute A in the PIM schema was moved from its current class C to another class D. Consider an attribute A in the PSM schema s.t. I(A ) = A. Since the interpretation is consistent, we see that class(a ) = C s.t. intcontext(c ) = C = class(a). By executing class a (A, D) we get class(a) = D. We see that, on one hand, A is semantically an attribute of C. On the other hand, we see that the semantics is A which is an attribute of D /= C. Therefore, the move of A must be propagated to a corresponding move of A. Concretely, we have to move A to a class with an interpretation D to make the interpretation consistent. The move and its propagation is illustrated in Fig. 9. Fig. 9(a) contains a PIM schema fragment before executing the operation (on the left hand side of the thick arrow) and the fragment after the move (on the right hand side). Fig. 9(b) and (c) contain three PSM schema fragments before and after the propagation. They illustrate three basic situations which may occur. Suppose class C in the PSM schema with interpretation C. Let A be an attribute of C with an interpretation A. The first situation is depicted in Fig. 9(b). Here, C contains an association R with an interpretation R. Its child is class D with an interpretation D. In this case the propagation means moving A to D which makes the interpretation consistent. The second situation is depicted in Fig. 9(c). Here, there is an association R with an interpretation R which goes to C. Its parent is class D with an interpretation D. Again, this case means moving A to D. The last situation is depicted in Fig. 9(d). Here, there is no association connected to C and with an interpretation R. In this case, propagation means creating a new association R with an interpretation R connecting C and a new class D with an interpretation D. A may be again moved to D. Also there are some other situations which differ from the three demonstrated only in technical details. This includes situations with content models or classes without an interpretation on the path between C and D. We have solved these situations in our implementation but do not specify them in this paper. In a general case, there can be more and different associations R 1,..., R n connecting C and D. There are associations R 1,..., R n connected to C with directed images of R 1,..., R n as interpretations, respectively. If some R is missing, we ask a designer if it should i be created. 3 If we apply the previous idea, we get up to C v,1,..., C v,n classes, where A should be moved. However, such move is not possible. Instead, we make a copy A of A for each C i v,i and move the copy to C v,i. Making a copy means the following sequence of atomic operations: (1) creating A i, (2) synchronizing it with A (it is important since it specifies that A and A i model the same information), (3) setting the properties of A to the same values as A, i and (4) moving A i. In our sample evolution depicted in Fig. 7(e), the designer moved attributes street, city and country from class Customer to class Address. This makes the interpretation of the PSM schema depicted in Fig. 8(d) inconsistent. There are attributes street, city and country having the moved PIM attributes as their interpretation. Our propagation mechanism ensures automatically that the attributes are moved correspondingly so that the interpretation is consistent again as depicted in Fig. 8(e). First, the mechanism automatically creates a new class Address which was not present in the PSM schema and connects it with class Contact by a new association. Then, it automatically moves the attributes. 3 If all of them are missing and the designer decides not to create any, no propagation is performed. Removing components of S must be propagated by removing corresponding components of S or setting their interpretations to to keep the interpretation consistent. More specifically, removing an attribute A leads to removing each attribute A in S s.t. I(A ) = A or setting I(A ) =. Both solutions are correct (i.e. they do not break the consistency of interpretation) and, therefore, the designer has to decide. Removing an association leads mandatorily to removing each association R in S s.t. I(R ) is a directed image of R. We cannot set I(R ) =. This is because condition (3) of Definition 4.7, R with a non-empty interpretation has a child with an non-empty interpretation and vice versa. Setting I(R ) to would break this condition. And, finally, removing a class C leads to removing each class C in S s.t. I(C ) = C or setting I(C ) to. Both possibilities are correct. From the precondition of the operation for removing a class, C has no attributes and there are no associations connected to C. Because of conditions (2) and (3) of Definition 4.7, there is no attribute or association in S in the interpreted context of C with an non-empty interpretation. Therefore, it is possible to set I(C ) =. It is also possible to remove C. However, it may have attributes and there may be associations connected to C with empty interpretations. These must be removed first. There are also some technical details we do not discuss further. For example, parent ends of the associations going from C may be reconnected to the parent of C in certain cases etc. In our sample evolution depicted in Fig. 7(b), the designer removed association responsibility and class Partner with its attribute code. The propagation mechanism ensured that the corresponding components in our sample PSM schema were removed after a dialogue with the designer as depicted in Fig. 8(b). Synchronizing two sets X 1 and X 2 of components of S means that the existence of both sets must be synchronized at all levels. Whenever there is an equivalent to X 1 in the PSM schema S there must also be an equivalent to X 2 and vice versa. The operation for synchronization of attributes, i.e. a (X 1, X 2 ), only enables one to synchronize two sets of attributes which have a common class C. Let C be a class s.t. I(C ) = C which contains attributes whose interpretations are all attributes from X 1. Since X 1 and X 2 are semantically equivalent, our propagation mechanism interprets this as a fact that C must be supplemented with new attributes so that it contains attributes whose interpretations are all attributes from X 2 as well (and conversely) (there are some technical details we do not discuss in more detail). First, we have to consider also all attributes of repr(c ) if repr(c ) /=. Second, we have to consider all attributes with C as an interpreted context, not only the attributes of C. In our sample evolution depicted in Fig. 7(c), the designer split the original attribute address into three new attributes street, city and country. For this, after the creation of the new attributes (which is not propagated to the PSM level as we have already discussed), the designer synchronized the original attribute with the new ones. The synchronization is automatically propagated by our mechanism as follows: wherever there is an attribute address with interpretation address in a PSM schema, create three new attributes street, city and country and synchronize them with address. The result of this automatic propagation is depicted in Fig. 8(c). Note that after the synchronization, the designer removed the original attribute address. This was propagated by our mechanism to removing the attribute address in the PSM schema after the decision of the designer. The operation for synchronization of associations, i.e. r (X 1, X 2 ), is very similar to the previous case. Again, when their common class C contains associations whose interpretations are directed images of all associations from X 1, it must be supplemented so that it contains associations whose interpretations are directed images of all associations from X 2, and vice versa. Again, there are technical details we omit for space limitations, i.e. we must not forget those

698 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 707 Fig. 9. Visualization of the mechanism for propagating the operation for moving PIM attributes.

130 698 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 9. Visualization of the mechanism for propagating the operation for moving PIM attributes. associations that are implicitly in the content of C it is a structural representative and we must consider not only associations which have C as a parent but all which have C as their interpreted context. Synchronization of associations is demonstrated in Fig. 7(g), where the designer split association address into two new associations shipto and billto. The result of automatic propagation is depicted in Fig. 8(g) Propagation from PSM to PIM level In this section, we describe the opposite direction of propagation, i.e. how operations executed on a PSM schema S PSM are propagated to the PIM schema S. Again, we will demonstrate the propagation on our sample PIM and PSM schema evolution depicted in Figs. 7 and 8, respectively. Now, we will, however, suppose that the designer manually changes the PSM schema according to the steps depicted in Fig. 8. We show how our propagation mechanism ensures that the PIM schema is adapted automatically. Creating a new component in S does not directly imply an existence of any component in S and, therefore, creation operations are not propagated from PSM to the PIM level. For example, when creating a new class Address and association address, connecting Contact and Address in Fig. 8(e) does not imply creating a corresponding class and association in the PIM schema. The creation will be performed in the PIM schema only when it is explicitly required by the designer. The propagation mechanism then also ensures that interpretations of the class and association in the PSM schema are set correctly. The result is depicted in Fig. 7(e). Updating components in S has an effect on corresponding components in S with some exceptions; there are updates with no effect on the PIM schema S, because the updated properties have no equivalent in S. This includes updating a structural representative of a class, updating the position or XML form of an attribute and updating the position of an association. There are also updates which are only optionally propagated to S. This is similar to the other direction; for example, changing a name of an attribute. And, there are operations which are propagated mandatorily. The simple case is, for example, updating a cardinality which is propagated in a straightforward manner. And, as in the other direction, there are two operations whose propagation is mandatory and more complex: moving an attribute and reconnecting an association end. When the interpretation of a moved attribute or reconnected association is empty, the change is not propagated at all to S. This is because the updated component has no equivalent in S and, therefore, consistency of interpretation is not broken. Similarly, no propagation is necessary when the interpreted context of the updated component has not changed. In that case, there is no change from the conceptual point of view. For example, suppose the PSM schema in Fig. 8(b). Let the designer move attribute address from class Contact to Customer. The move is within the same interpreted context (which is class Customer ) and, therefore, the attribute was not moved from the conceptual perspective and its interpretation remains consistent. No propagation is necessary in this case. In other cases, propagation is necessary. However, except for the technical details, the principles of the propagation are similar to the other direction and so, we omit their detailed description. For example, suppose that the designer moves attributes street, city and country from class Contact to class Address as depicted in Fig. 8(e). The interpreted context is changed (from class Customer to class Address ). Our propagation mechanism automatically ensures that the interpretations of the three attributes (i.e. street, city and country) are moved correspondingly in the PIM schema. The resulting PIM schema is depicted in Fig. 7(e). Similar to the updates, removing a component from S is not propagated to S when the removed component has an empty interpretation. Removing a PSM component X with an interpretation X may imply removing X when there are no other PSM components with interpretation X. However, even when there are no PSM components with interpretation X, we do not remove X automatically. This is because PSM schemas are only views of the whole domain modeled by the PIM schema. Absence of a given concept modeled by X in the views does not imply the necessity of removing X from the PIM schema. The removal of X is, therefore, only optional. For example, when the designer removes class Partner as depicted in Fig. 8(b), the propagation mechanism asks the designer whether the corresponding class Partner in the PIM schema should be removed as well, or not. In our case, the designer decides to remove Partner as depicted in Fig. 7(b). Synchronization of two sets X 1 and X 2 is propagated from S to S very similarly as in the opposite direction. The only difference is that there are components in X 1 and X 2 with and without an interpretation. If X 1 (or X 2 ) contains only components with an interpretation, its semantic equivalent exists in S and each component X of X 2 (or X 1 ), respectively, which does not have an interpretation is, therefore, propagated to S. Propagation means creating a new component X corresponding to X and setting I(X ) = X. Otherwise, the synchronization is not propagated to the

131 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) PIM level, because an equivalent of X 1 or (or X 2, respectively) does not exist in S. Sample synchronization operations are demonstrated in Fig. 8(c) (attribute synchronization) and Fig. 8(g) (association synchronization). They are automatically propagated by our mechanism to the PIM schema as depicted in Fig. 7(c) and (g), respectively Minimality and correctness of atomic operations Important properties of any set of atomic operations are their minimality and correctness. Minimality means that there is no atomic operation which could be expressed as a sequence of other atomic operations. Correctness means that the proposed operations are correct. In our specific case it means not only that an atomic operation transforms a schema from a consistent state to another consistent state but also that the propagation mechanism preserves the consistency of interpretations of PSM schemas to PIM schemas. Theorem 6.1. The set of atomic operations is minimal. Proof. Assume the operations for evolution of classes in the PIM schema, i.e. c, ı c and name c. Without c we are not able to create any class. Similarly, without ı c we are not able to remove any class. Finally, without name c we are not able to change the class name. It cannot be set during the creation, because c sets a default name. The proof for other atomic operations for creating, removing and updating PIM and PSM components is similar. The operations for synchronizing two sets of attributes or associations are clearly atomic as well. No other operation allows for synchronization. Theorem 6.2. The set of atomic operations together with the propagation mechanism is correct. Proof. We have already proved the correctness in the previous text. In Section 5.2 we have shown that the preconditions of operations for updating interpretations of PSM components ensure that the consistency of interpretation can not be broken. In Sections 6.1 and 6.2 we have shown that the propagation mechanism repairs the consistency of interpretation when broken by moving attributes, reconnecting associations ends and removing components. We have also shown that the other operations do not touch the consistency at all. And, finally, we have shown in these sections that whenever the propagation mechanism needs to perform a sequence of atomic operations to repair the consistency of interpretation, the preconditions of these operations are always satisfied so that the sequence may be performed in any time Completeness of atomic operations Sometimes completeness is understood as a property which ensures that for any two given schemas there always exists a sequence of atomic operations which transform one of the schemas to the other. The sequence usually removes all components of the former schema and creates the components of the other. This is not a correct proof of completeness, because it does not consider possible semantic relationships between the components of both schemas. The old components are simply removed and the new ones are created without preserving the semantic relationships. However, this only covers the structural part of the schema. What we also aim for is preserving the semantic part of the schemas. This is largely dependent on the user and his/her interpretation of the meaning of the schemas. However, even if semantic relationships are considered (e.g. semantic equivalence in our case) it is not easy to prove general completeness formally. Even though such proof would be interesting from the theoretical point of view, it is beyond the scope of this paper. Instead, our aim is to demonstrate completeness practically. In this paper we provide a case study of a real world system, where we applied our approach. It experimentally shows that the proposed set of atomic operations is complete. The case study can be found in Section Composite operations The atomic operations introduced formally in the previous sections were proposed so that they form minimal and correct set as proven above. Naturally, they are not supposed to be used directly by the user in all cases and it is not the whole set of available operations. In this section we show examples how the atomic operations can form more user-friendly and realistic composite operations. Formally, a composite operation is a sequence of two or more atomic operations. As we have shown in the previous text, propagation mechanism ensures that any atomic operation does not corrupt the consistency of affected interpretations. Therefore, composition of atomic operations preserves consistency as well and it is not necessary to extend the propagation mechanism with specifics of the composition. Creation with parameters. A simple composite operation necessary in every system is creating a particular component with pre-set values. We show such case in the operation createpimattr(c, n, t, c) which allows for creating of a PIM attribute in a class C with name n, data type t and cardinality c. It consists of the following steps: A = a(c); name a (A, n); type a (A, t); card a (A, c) The propagation mechanism optionally creates corresponding PSM components. Splitting of a PIM attribute. This operation is a typical example of drill-down modeling, i.e. creating more and more precise data structures. An example of such an operation is shown in Fig. 7(c), where the designer needs to detail a single-valued address of a customer to street, city and country. In general, the composite operation split- PIMAttr(A, {n 1, n 2,..., n k }) for splitting a PIM attribute A of a class C to a set of attributes with names {n 1, n 2,..., n k } consists of the following steps: A 1 = createpimattr(c, n 1, type(a), card(a));... ; A k = createpimattr(c, n k, type(a), card(a)); a ({A}, {A 1, A 2,..., A k }); ı a (A) The propagation mechanism, in particular in case of synchronization, ensures that all the PSM attributes representing A are replaced with PSM attributes representing A 1, A 2,..., A k. In our sample depicted in Fig. 7(c), the designer would execute a single composite operation splitpimattr(address, { street, city, country }). Removing a PSM tree. In the previous case we have shown an example of a composite operation which consists of a sequence of atomic operations and a composite operation which consists of atomic and other composite operations. Operation removepsmtree(c ) for removing a PSM tree rooted at class C is an example of a recursive composite operation, i.e. it calls itself if necessary. The operation consists of the following steps: 1. ( R content(c )) removepsmtree(child(r )); 2. ( A attributes(c )) ı a (A ); 3. ( R p S r s.t. child(r p ) = C )(ı r (R )); 4. ı c (C ); Naturally, we cannot provide the full list of possible composite operations, as the particular set depends on the choice of the vendor of a particular system and the requirements of users. Our aim was

132 700 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 10. Current XML evolution approaches. to demonstrate that the proposed mechanism can be used in realworld situations. 8. Related work The current approaches towards evolution management can be classified according to distinct aspects (Mens and Van Gorp, 2006; Czarnecki and Helsen, 2006). The changes and transformations can be expressed (OMG, 2005; Boronat et al., 2006) as well as divided (Cicchetti et al., 2009) variously too. However, to our knowledge there exists no general framework comparable to our proposal in Section 3; particular cases and views of the problem have previously only been solved separately, superficially and mostly imprecisely without any theoretical or formal basis. In this section we describe the closest and most advanced approaches related to our proposal. XML view. We can divide the current approaches to XML schema evolution and change management into several groups as depicted in Fig. 10. Approaches in the first group (a) consider changes at the schema level and differ in the selected XML schema language, i.e. DTD (Al-Jadir and El-Moukaddem, 2003; Coox, 2003) or XML Schema (Tan and Goh, 2005; Cavalieri, 2010). In general, the transformations can be variously classified. For instance, paper (Tan and Goh, 2005) proposes migratory (e.g. movements of elements/attributes), structural (e.g. adding/removal of elements/attributes) and sedentary (e.g. modifications of simple data types). The changes are expressed variously and more or less formally. For instance in Cavalieri (2010) a language called XSUpdate is described. The changes are then automatically propagated to the extensional level to ensure validity of XML data. There also exists an opposite approach that enables one to evolve XML documents and propagate the changes to their XML schema (Bouchou et al., 2004). Approaches in the second (b) and third (c) group are similar, but they consider changes at an abstraction of logical level either visualization (Klettke, 2007) or a kind of UML diagram (Domínguez et al., 2005). Both cases work at the PSM level, since they directly model XML schemas with their abstraction. No PIM schema is considered. All approaches consider only a single separate XML schema being evolved. Another open problem related to schema evolution is adaptation of the respective XML queries, i.e. propagation to the operational level (Fig. 10(d)). Unfortunately, the amount of existing works is relatively low. Paper (Moro et al., 2007) gives recommendations on how to write queries that do not need to be adapted for an evolving schema. On the other hand, in Geneves et al. (2009) the authors consider a subset of XPath 1.0 constructs and study the impact of XML schema changes on them. In all the papers cited the authors consider only a single XML schema. In Passi et al. (2009) multiple local XML schemas are considered and mapped to a global object-oriented schema. Then, the authors discuss possible operations with a local schema and their propagation to the global schema. However, the global schema does not represent a common problem domain, but a common integrated schema; the changes are propagated just upwards and the operations are not defined rigorously. The need for well defined set of simple operations and their combination is clearly identified in Section 6 of a recent survey of schema matching and mapping (Bellahsene et al., 2011). Storage view. The idea of evolution and change management in XML storage strategies is currently focused particularly on data updates and, usually, joined with XQuery Update Facility (Chamberlin et al., 2007). However, this is not the area we are dealing with since the updates are mostly considered within the respective XML schema. As depicted in Fig. 10(e), in the area of evolution of general database schemas we can find approaches that focus on evolution of (object-)relational schemas (Curino et al., 2009, 2008) as well as object-oriented schemas (Banerjee et al., 1987; Lerner, 2000). Similar to the case of XML schema evolution, there are also approaches that deal with propagation from an ER schema, i.e. PSM level, to a relational schema (An et al., 2008b), i.e. schema level (Fig. 10(f)) or propagation to an operational level (Curino et al., 2009) (Fig. 10(g)). In the purely XML-related approaches we need to consider schema-driven storage strategies. As surveyed in Simanovsky (2008), the amount of the respective approaches is not high. We can find first attempts of change propagation in the current leading object-relational database management systems Oracle DB, 4 IBM DB2 5 and Microsoft SQL Server. 6 In this case we can differentiate two types of schema evolution whether backward compatibility of the changes, i.e. preservation of data validity, is required, or not. Both the DB2 and SQL Server require the backward compatibility. Oracle DB also supports change propagation regardless backward compatibility; however, it is not done automatically; a data expert must provide an XSLT script which re-validates the stored XML documents. To ease this approach we have recently proposed an algorithm that enables one to provide such transformation script semi-automatically (Malý et al., 2011). Processing view. Since we are considering the area of evolution of XML applications, we cannot omit the most popular applica

M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 707 701 Fig. 11. (a) PIM schema modeling the NRPP domain, (b) PIM schema evolved according to new requirements.

133 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 11. (a) PIM schema modeling the NRPP domain, (b) PIM schema evolved according to new requirements. tion of XML format Web Services. Currently we can find several approaches that deal with evolution of Web Service; however, again they solve just part of the issues described (ai Sun et al., 2010). In Sindhgatta and Sengupta (2009) the authors describe a plugin to IBM Rational Software Architect (RSA) 7 which enables semiautomatic propagation of changes from business process model of Web Services (Fig. 10(h)) to respective BPEL scripts and thus respective applications. It is one of the frameworks that are very close to our proposal described in Section 3; however, the authors do not provide any theoretical background on the allowed changes or details on the propagation mechanisms. A different approach (Fig. 10(i)) is used in system Morpheus (Ravichandar et al., 2008), also based on IBM RSA. At the platform-specific level it considers three UML artifacts use cases, sequence diagrams and service specifications and the change propagation among them. The output of the propagation is a set of change suggestions for the respective execution part which should be then done manually by an expert. Similarly, in Ryu et al. (2008) (Fig. 10(j)) the authors deal with change propagation of business protocols of Web Services, i.e. a kind of activity diagrams. The output of the system is a set of recommendations detailing when affected parts are replaceable/migrateable and under what circumstances. Again, the migration is expected to be done by a system expert; however, the system advises how to perform it correctly. In Andrikopoulos et al. (2008) the authors solve the problem using a completely different strategy. They provide an abstract service definition model (ASD) which enables us to model all related concepts of a Web Service, i.e. data structures, behavior and policies at a conceptual level using UML class diagrams. Both ASD and the related operations are defined formally and the completeness and correctness of the operations is proven. On the other hand, change propagation to respective PSMs is not considered and the ASD itself is relatively unnatural. And, considering even more formal approaches and model, in Aversano et al. (2005) the authors model the Web Services using Formal Concept Analysis and, in particular, lattices or in Stevens (2010) using lenses and monoids of edits. However, though the approaches are theoretically very interesting, our aim is to provide less complex and more user-friendly formal background and tools. 9. Case study and evaluation As has already been mentioned, we have implemented the proposed technique in a tool called exolutio (Klímek et al., 2011). In general, it is a proof-of-concept desktop application for conceptual XML data modeling. It implements the PIM and PSM modeling languages and operations for evolution of the PIM and PSM schemas described in this paper. It provides a designer with a set of operations which are composed of the atomic operations described in Section 5. It implements the propagation mechanism introduced in Section 6. At the highest level, exolutio is based on a well known Model View Controller (MVC) design pattern. Currently, for the purpose of this paper, the atomic operations are implemented in the exact same way they are described here. We use the implementation to experimentally demonstrate that the proposed set of atomic operations is complete, i.e. that the atomic operations are sufficient for real-world situations. As we have already discussed in Section 6.4, we do not prove completeness formally in this paper. As to performance and scalability, it is a fact that a single atomic operation on a PIM schema can lead to a large number of operations in each of the affected PSM schemas. This number can be reduced by some optimizations, improving both performance and scalability. So far, our implementation is strictly based on our formal model and focuses on the clear demonstration of our ideas. The issues of performance and scalability will be addressed in later stages of development. It is now clear that some complex operations are far more efficient if they are implemented from scratch, rather than by combining the individual atomic operations. This also holds true for some cases of change propagation. Still, it will always be necessary to prove that the optimized version of the operation has the same formal properties as the non-optimized version would have, which is possible again thanks to our formal model. In addition, many operations are in fact interactive. For example, the designer will choose to which PSM schemas a change will be propagated, in which case the actual time spent by performing the operation will always be comparatively negligible. In the concluding part of this section we show how the developed technique for designing a family of XML schemas and their evolution on a real-world system was applied. And, finally, we evaluate our technique on the basis of this case study and compare it with other known techniques for XML schema evolution Case study: national register for public procurement Our case study is the National Register for Public Procurement (NRPP). 8 It is a governmental information system intended for publishing data about public contracts by public authorities in the (in Czech only).

134 702 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 12. PSM schemas modeling XML formats for (a) sending contract notifications to NRP, (b) reporting on contract supplier selection to the NRPP, and (c) representing procurer detail. Czech Republic. Publishing a contract is only obligatory when the contracted price exceeds a level given by the current legislation; otherwise, it is optional. Authorities send contract information formatted according to one of the 17 XML formats accepted by the NRPP. This includes, e.g. XML format for contract notifications, supplier selection notifications, etc. Currently, the NRPP only provides a textual documentation for the XML formats and a set of sample XML documents. There are neither XML schemas for the XML formats, nor a conceptual schema of the problem domain. Therefore, our first goal was to design not only the XML schemas but also the conceptual schema in a form of a PIM schema and derive PSM schemas for the XML formats from the PIM schema. The resulting PIM schema is depicted in Fig. 11(a). Two of the resulting PSM schemas are depicted in Fig. 12(a) and (b). The PSM schemas are mapped to the PIM schema. The mapping is intuitive and we do not describe it here. The PSM schemas were created exactly according to the textual documentation and XML examples. Let us note that the original schemas we created are more extensive. Due to space limitations, we present here only those parts that bear on our work. The PIM schema contains classes which model public contracts (class Contract) and their procurers and suppliers (class Organization). There are also some additional concepts modeled prices (class Price) and contact information (class Contact). There are several relationships modeled with associations. A supplier is associated with a contract by supplied by association. A procurer is associated with a contract by a path of associations has contact and main. Each contract has additional contact information where documentation for the contract is provided (association docs) and where bids to the contract are collected (association bids). Finally, there are four different prices expected price (association expected), the best offered price (association offered), price agreed by a selected supplier and procurer (association agreed), and a final real price known after finishing the contract (association final). The PSM schema depicted in Fig. 12(a) models an XML format for notifications about a new public contract. When a public authority issues a new contract, it must send a notification about the contract to the NRPP using this format; it should contain contact information and basic information about the contract. The other PSM schema depicted in Fig. 12(b) models an XML format for notifications about the supplier selected for the contract; it contains the main contract contact, information about the number of offered bids, selected supplier and offered and agreed price. The numbers of atomic operations executed to create the PIM and PSM schemas are depicted in Fig. 13(a). It shows that only creation and update operations were used α ν δ σ α ν δ σ α ν δ σ (a) (b) (c) (d) α ν δ σ Fig. 13. Numbers of atomic operations performed manually by the designer (dark gray) and automatically by the propagation mechanism (light gray).

135 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Fig. 14. (a) PSM schema with common components shared between other PSM schemas, (b) evolved PSM schema for reporting on contract supplier selection to the NRPP, (c) evolved PSM schema for representing procurer detail. There were several issues to solve in this case study. First, the NRPP provides only XML formats which are used by public authorities to send data about their contracts to the NRPP. There are no XML formats for providing information back to the public authorities and other users, e.g. procurer or supplier detail. We show how our approach may be used to easily design such XML formats in a form of PSM schemas on the basis of the existing PIM schema. One such PSM schema which models XML formats for public procurer details is depicted in Fig. 12(c). The numbers of the atomic operations executed at this step are depicted in Fig. 13(b). Again, only the creation and update operations were performed. Even though the designer needs to design the PSM schemas for the new XML formats manually, the experiment clearly showed that our approach saves him/her a great deal of work and prevents him/her from making unnecessary errors. This is because our technique enables us to create the PSM schemas on the basis of the PIM schema (which is quicker than creating PSM schemas separately) and ensures that the designer creates the PSM schemas coherently with the PIM schema (as it preserves the consistency of the interpretation). The designer needs not check whether the PSM schema is semantically correct, or not. Second, as the reader may have noticed, the quality of the XML formats is low. The designers of the XML formats did not follow basic XML design principles (e.g. exploiting the hierarchical nature of XML); for example, contact information is modeled by XML elements with names prefixed with cont, docs, etc. It would have been better to remove the prefixes and enclose the semantically related XML elements into separate XML elements (e.g. enclose contact XML elements to XML element contact structured to main, doc, etc. or enclose all information related to the supplier into XML element supplier). We have made these adaptations in the present XML formats. Some PSM schema components also appeared which had the same content and we, therefore, used structural representatives to declare the shared content only once. The numbers of the executed atomic operations are depicted in Fig. 13(c). In this step, synchronization and removal operations were also used, because some of the old parts of the PSM schemas were replaced by new ones. Again, the experiment demonstrated that our approach saves a lot of work as it preserves the consistency of PSM schemas against the PIM schema. If the designer makes a change which affects the PIM schema and, possibly, other PSM schemas, our propagation mechanism will notify him/her. We depict the evolved PSM schema from Fig. 12(b) in Fig. 14(b) (it also includes changes described in the following steps). The other PSM schema was evolved similarly. As the reader may see, contact information is now represented hierarchically. Also, the PSM schema is simplified by using structural representatives referring to shared classes contained in a new separate PSM schema depicted in Fig. 14(a). Third, we implemented various changes which resulted from new requirements on the NRPP functionality and from new legislation. In both cases, changes to the PIM schema needed to be done. The new functionalities required us to model contact persons as a special class instead of attribute contact person. Therefore, we evolved the attribute to a new class Person associated with Contact and with two new attributes first name and surname using our evolution operations. The new legislation required to report not only the number of bids received for each contract, but also particular bids including the bidding supplier and offered price. Therefore, we replaced the attribute number of bids with a new class Bid with several new attributes. We changed the semantics of supplied by and offered

136 704 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) associations by reconnecting them from Contract class to the new Bid class. Finally, we distinguished the winning bid from the other bids by splitting the association connecting Bid and Contract classes into two associations offer and win. The evolved PIM schema is depicted in Fig. 11(b). Since the PIM schema changed, the PSM schemas needed to be adapted accordingly. This was ensured by our propagation mechanism. Fig. 14(b) and (c) show how PSM schemas depicted in Fig. 12(b) and (c) were automatically adapted by the propagation mechanism, respectively. Finally, there was a requirement to update the XML format for contract notifications (Fig. 12(a)) so that it is possible to give notification not only on the expected months and days in which the contract should be finished, but also on the exact date. Therefore, we added a new attribute exp date which can be used equivalently instead of two present attributes exp months and exp years. This change was correctly propagated to the PIM schema, because it is a conceptual change (see Fig. 11(b)). From here, it was propagated to the other PSM schemas (see Fig. 14(c)). The numbers of the atomic operations executed during the last two steps are depicted in Fig. 13(d). The darker part shows the numbers of manually executed operations. The lighter part shows the numbers of operations executed automatically by the propagation mechanism Evaluation and comparison to other approaches The following conclusions are drawn from the above case study: All proposed atomic operations are necessary for real-world scenarios as summarized in Fig. 13. The necessity for the creation and updating of atomic operations is clear. The case study showed that we also need removal operations even though we do not want to directly remove parts of data but represent them in more (or less) detailed structures (e.g. splitting attributes). For this, we also need synchronization operations. The case study also demonstrates the completeness of the proposed set of atomic operations. Most real-world scenarios we target in our work will be similar to the presented case study (i.e. extending existing schemas with new parts and replacing their existing parts with more (or less) detailed alternatives). For this kind of scenarios our proposed set of operations is complete. On the other hand, there are some limitations. For example, when synchronizing two sets of attributes, we can not exactly specify a function which would transform values between both sets. However, we are not interested in data transformations in this paper but only PIM and PSM schema evolution. The existence of the PIM schema and interpretations of PSM schemas against the PIM schema is beneficial when the designer performs creation and update operations for building new PSM schemas or new parts of existing PSM schemas. Our technique ensures that the designer creates new PSM components consistently with the PIM schema (from the conceptual perspective). This ensures semantical coherence between the modeled family of XML schemas. All XML schemas in the family, even those designed by different developers, are consistent with the PIM schema. The designers need not check this coherence manually which saves them a great deal of work and prevents design errors. Sometimes the designer may want to optimize the structure of an XML schema but avoid changes to the semantics of the XML schema. When the designer works with the PSM schema, our mechanism is able to prevent these changes. It can automatically check whether a change to the PSM schema needs to be propagated to the PIM schema, or not. This also saves the designer a lot of work, because (s)he does not need to check this manually. Finally, when the designer needs to change the PIM schema, our mechanism automatically propagates the changes to the PSM schemas and vice versa. Again, this saves work and prevents errors, because the designer does not need to propagate the changes manually. Fig. 13(a) (d) shows the number of atomic operations performed by the designer in our case study. In comparison to existing approaches to XML schema evolution, our technique saves the designer a great deal of manual work. This is because we consider the PSM schemas interpreted against a single common PIM schema. As we have shown this saves work and prevents errors when the designer needs to check the semantical consistency of his/her new or evolved part of a PSM schema and when making changes to PIM schema or PSM schemas and their propagation to the other schemas. The amount of work saved in comparison to other approaches is demonstrated by Fig. 13. The darker columns show the amount of atomic operations performed manually by the designer. These operations are assisted by our technique which ensures that the consistency between the created XML schemas is preserved. The designer does not need to check consistency manually which saves a lot of time. This consistency check is not provided by existing approaches, where the designer has to do the check manually. The lighter columns show the amount of atomic operations performed automatically by our propagation mechanism. Again, propagation is not supported by existing approaches and these operations would have to be done manually by the designer. We can also see a fundamental problem in the current approaches, because they do not consider synchronization operations or their equivalent. Without this operation a correct propagation between PIM and PSM schemas is not possible. As we have shown, this is necessary in various practical situations when a part of a PIM or PSM schema is split into more detailed parts. It is also useful in extending an existing part with new components, as well as in a reversed process when more parts of a schema are merged together. On the other hand, our approach is more laborious in the initial phases, because the PIM schema and PSM schemas modeling the XML schemas must be created. This is not the case of the other approaches which work directly with an XML schema or its direct translations to a conceptual schema. Therefore, the other approaches are more suitable in situations, where the designer works only with a single XML schema. When a family of XML schemas needs to be managed, our approach is more beneficial. Finally, let us note that the approach presented deals only with PIM and PSM schemas and propagation of changes between both levels. It does not solve the problem of propagation of changes to the data, i.e. XML documents. As we have shown, this has been solved by other approaches. We have also worked on this problem in our previous work. In Malý et al. (2011) we show how XML documents need to be adapted when a PSM schema which models their XML schema evolves. 10. Conclusions In this paper we focused on two of the main challenges of Model-Driven Development (France and Rumpe, 2007) evolution and its formal specification. In particular, we were interested in model driven XML schema evolution and concentrated on the PIM and PSM levels of our previously proposed five-level evolution framework. We defined PIM and PSM schemas for modeling XML schemas formally and extended them with atomic and composite operations for their modification. We then identified minimal set of atomic operations, proved its correctness and specified the respective mechanism for automatic propagation of changes between PIM and PSM levels. The formal basis of the operations enables us to ensure that the framework is designed correctly. Next we

137 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) introduced implementation of the framework and depicted the advantages of the system in a real-world use case. Key contributions. If we compare the proposed system with the current approaches, we can identify several key contributions and innovations it brings: Global view of the evolution problem: As mentioned in Section 8, the existing approaches towards the evolution and change management of XML schemas consider only a single XML schema. Our proposed technique considers a family of XML schemas applied in a system. Formal basis of the proposal: Similar to the current work being done, we exploit the idea of a platform-independent conceptual schema (PIM schema) of the problem domain which allows for abstraction from technical details of particular XML schemas. We also consider a platform-specific (PSM) schema for each targeted XML schema. Each PSM schema is mapped to the PIM schema and can be automatically translated to an expression in a selected XML schema language such as XSD or RELAX NG. We defined PIM and PSM schemas and mappings between them formally, which enables us to effectively manage the evolution of XML schemas. When a change to the PIM schema is made, we can precisely identify all the parts of XML schemas affected by this modification and, conversely, when a change to the PSM schema is made, we can identify whether the PIM schema is affected and how, or not. Hierarchy of operations: Naturally, the idea of change management is based on a set of operations. As we have mentioned, they can be classified variously and current approaches utilize different sets. In our work we defined a set of atomic operations and proved its minimality and correctness. Having this concept, we could restrict ourselves to this set and define the respective change propagation precisely and correctly. Last but not least, we showed that using the set of correctly defined atomic operations and the respective change propagation we can define any composite and, hence, more user-friendly operation. The respective change propagation is then defined implicitly and its correctness is ensured as well. A system of operations similar to this was identified in a recent survey (Bellahsene et al., 2011) (Section 6), but has not yet been researched properly. Experimental implementation of the proposal: The final contribution of this paper is not only the proposal itself, but also its experimental and open-source implementation exolutio. Even though it currently does not cover all the proposed aspects (it is still under intense development), a user may test the key features for his/hers real-world examples. For instance, recently it has been tested in real-world use-cases by the Fraunhofer Institut. 9 Future work. Since the area of XML evolution is relatively new, the number of current approaches and consequently solved issues is not high; there is a significant amount of open problems and future directions we want to focus on. The key areas involve: Specifics of XML schema languages: In Nečaský and Mlýnková (2010), we show that the introduced PSM is equivalent to the formalism of regular tree grammars (Murata et al., 2005) which are considered as a basic formalism of XML schema languages. However, practical XML schema languages such as (Thompson et al., 2004), introduce various other concepts (e.g. namespaces, use-defined simple data types, etc.) which we did not consider in this paper. We are continuously extending our implementation with these extensions. 9 Other conceptual modeling constructs: It is common in practice to use other modeling constructs (e.g. inheritance, n-ary relationships, etc.). In our future work, we will deal with these constructs as well. Advanced integrity constraints of the XML data: As we have mentioned, currently we do not consider integrity constraints that can be expressed using Schematron (Jelliffe, 2001) or XML Schema assertions (Thompson et al., 2004) and we focus on purely grammar-based languages. Our future step is incorporation of advanced integrity constraints into the whole framework. In particular, this involves extension of all levels with advanced constraints, specification of a respective language for the conceptual model and extension of the propagation mechanism. This also includes extending the concept of semantic equivalence with the real semantics (i.e. specification that an address is not only semantically equivalent with street and city but also how). The current approaches (e.g. Xiong et al., 2009) mainly exploit the idea of writing a set of correction rules (mostly using OCL (OCL, 2009) or its extension) that are applied when a particular constraint is violated. Since the language is too general, we will focus on its reasonable subset (Nečaský and Opočenská, 2009; Opočenská and Kopecký, 2008). Operational and extensional level of the framework: As we have described in Section 3, in this paper we focused on a subpart of the proposed framework data representation. So, naturally, our next work will focus on those parts which were omitted, especially extensional and operational level. The extensional level is crucial for the applicability of our solution during system runtime. The operational level is also very important and has been mostly omitted in the current literature. Modeling of storage strategies: Similar to the previous point, in our future work we plan to focus on other parts of the framework which were omitted. An important aspect that has so far not been much considered is the relation of change propagation and XML storage strategies. Currently there are approaches that deal with evolution of database schemas (Curino et al., 2009; Banerjee et al., 1987), but in our case we have to consider a set of applications that form the system, the fact that the XML views of the data can and will overlap and exploitation of the relations between components of the framework. At the same time, we want to preserve the optimal storage strategy for a given application (Mlýnková, 2009). Modeling of business processes: As we have mentioned, in this text we considered only the modeling of XML data processed and exchanged within an XML system. However, not only the data structures, but also the respective business processes need to be designed, maintained and updated within the evolution process. Our other future aim is to extend and combine the conceptual models of XML data with the respective business processes, to preserve mutual relations and exploit them during the evolution process (Murzek et al., 2006). Relation to ontologies: An up-to-date and important aspect of data management is establishing and exploiting their semantics. Undoubtedly, the most popular tool for this purpose are currently ontologies. Since an ontology can be viewed as a particular type of schema which has strong relationship to a given XML schema, a natural open issue is developing and maintenance of such relationship under application evolution (Yu and Popa, 2005; An et al., 2008a). However, since the ontologies bear a special type of information, their treatment requires specific approaches (Noy and Klein, 2004). Advanced operations with an XML system: In this paper we described two types of operations that can occur in an XML system atomic and composite. However, these are not the only operations that can occur within the system. If we consider the area of integration, we need to deal with the problem of a new

138 706 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) incoming application and its integration with the current ones (Nguyen et al., 2011), or even integration of a whole XML system. This wide area involves issues such as reverse-engineering of conceptual models and schema matching (Wojnar et al., 2010; Klímek and Nečaský, 2010; Tekli et al., 2009), similarity evaluation (Wojnar et al., 2010) etc. Full implementation of the proposal and efficiency: Finally, we intend to gradually extend the implemented framework with the proposed improvements. Our aim is to provide a tool that is not only an experimental prototype, but can be applied in complex real-world use cases. Naturally, since our aim is a fully applicable software, we will need to deal with aspects like benchmarking (Alexe et al., 2008) and optimization (Langlois et al., 2006). Acknowledgements This work was partially supported by the Czech Science Foundation (GAČR), grant numbers P202/11/P455, 201/09/P364 and P202/10/0573. References ai Sun, C., Rossing, R., Sinnema, M., Bulanov, P., Aiello, M., Modeling and managing the variability of web service-based systems. J. Syst. Software 83 (3), Alexe, B., Tan, W.-C., Velegrakis, Y., STBenchmark: towards a Benchmark for Mapping Systems. Proc. VLDB Endow. 1 (1), Al-Jadir, L., El-Moukaddem, F., Once upon a time a DTD evolved into another DTD.... In: Object-Oriented Information Systems. Springer, Berlin, Heidelberg, pp An, Y., Borgidaa, A., Mylopoulos, J., 2008a. Discovering and maintaining semantic mappings between XML schemas and ontologies. J. Comput. Sci. Eng. 2 (1), An, Y., Hu, X., Song, I.-Y., 2008b. Round-trip engineering for maintaining conceptual relational mappings. In: CAiSE 08: Proc. of the 20th Int. Conf. on Advanced Information Systems Engineering, Springer-Verlag, Berlin, Heidelberg, pp Andrikopoulos, V., Benbernou, S., Papazoglou, M.P., Managing the evolution of service specifications. In: CAiSE 08: Proc. of the 20th Int. Conf. on Advanced Information Systems Engineering, Springer-Verlag, Berlin, Heidelberg, pp Aversano, L., Bruno, M., Penta, M.D., Falanga, A., Scognamiglio, R., Visualizing the evolution of web services using formal concept analysis. In: IWPSE 05: 8th Int. Workshop on Principles of Software Evolution, pp Banerjee, J., Kim, W., Kim, H.-J., Korth, H.F., Semantics and implementation of schema evolution in object-oriented databases. SIGMOD Rec. 16 (3), Bellahsene, Z., Bonifati, A., Rahm, E., Schema Matching and Mapping, Data-Centric Systems and Applications. Springer, Berlin, Heidelberg, Berlin, Heidelberg. Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J., XQuery 1.0: An XML Query Language. W3C, URL: Boronat, A., Carsí, J.A., Ramos, I., Algebraic specification of a model transformation engine. In: FASE 06: Proc. of the 9th Int. Conf. Fundamental Approaches to Software Engineering, Vienna, Austria, vol. 3922, LNCS, Springer, pp Bouchou, B., Duarte, D., Alves, M.H.F., Laurent, D., Musicante, M.A., Schema evolution for XML: a consistency-preserving approach. In: Mathematical Foundations of Computer Science, Springer-Verlag, Prague, Czech Republic, pp Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E., Yergeau, F., Extensible Markup Language (XML) 1.0, fifth ed. W3C, URL: Cavalieri, F., EXup: an Engine for the Evolution of XML Schemas and Associated Documents. In: EDBT 10: Proc. of the 2010 EDBT/ICDT Workshops, ACM, New York, NY, USA, pp Chamberlin, D., Florescu, D., Melton, J., Robie, J., Siméon, J., XQuery Update Facility 1.0. W3C, URL: Chen, P., Entity-relationship modeling: historical events future trends, and lessons learned. In: Software Pioneers: Contributions to Software Engineering, Springer, New York, NY, USA, pp Cicchetti, A., Ruscio, D.D., Pierantonio, A., Managing dependent changes in coupled evolution. In: Proc. of the 2nd Int. Conf. on Model Transformations, ICMT 2009, Zurich, Switzerland, vol. 5563, LNCS, Springer, pp Coox, S.V., Axiomatization of the evolution of XML database schema. Program. Comput. Softw. 29 (3), Curino, C.A., Moon, H.J., Zaniolo, C., Graceful database schema evolution: the PRISM workbench. Proc. VLDB Endow. 1 (1), Curino, C., Moon, H.J., Zaniolo, C., Automating database schema evolution in information system upgrades. In: HotSWUp 09: Proc. of the 2nd Int. Workshop on Hot Topics in Software Upgrades, ACM, New York, NY, USA, pp Czarnecki, K., Helsen, S., Feature-based survey of model transformation approaches. IBM Syst. J. 45 (3), Domínguez, E., Lloret, J., Rubio, A.L., Zapata, M.A., Evolving XML schemas and documents using UML class diagrams. In: DEXA 05: Proc. of the 16th Int. Conf. on Database and Expert Systems Applications, vol. 3588, LNCS, Springer, pp France, R., Rumpe, B., Model-Driven Development of complex software: a research roadmap. In: FOSE 07: 2007 Future of Software Engineering, IEEE Computer Society, Washington, DC, USA, pp Geneves, P., Layaida, N., Quint, V., Identifying query incompatibilities with evolving XML schemas. In: ICFP 09: Proc. of the 14th ACM SIG- PLAN Int. Conf. on Functional Programming, ACM, New York, NY, USA, pp Hartung, M., Terwilliger, J., Rahm, E., Recent advances in schema and ontology evolution. In: Bellahsene, Z., Bonifati, A., Rahm, E. (Eds.), Schema Matching and Mapping, Data-Centric Systems and Applications. Springer, Berlin, Heidelberg, pp (doi: / ). ISO/IEC :2003 Part 14: XML-Related Specifications (SQL/XML), Int. Organization for Standardization, Jelliffe, R., The Schematron An XML Structure Validation Language using Patterns in Trees, ISO/IEC URL: Klímek, J., Nečaský, M., Integrating XML schemas for evolution of web services. In: ICWS 2010: Proc. of The 8th Int. Conf. on Web Services, IEEE Computer Society, Miami, FL, USA, pp Klímek, J., Malý, J., Nečaský, M., exolutio A Tool for XML Data Evolution. URL: Klettke, M., Conceptual XML schema evolution the CoDEX approach for design and redesign. In: BTW 07, Aachen, Germany, pp Langlois, B., Exertier, D., Bonnet, S., Performance improvement of MDD tools. In: EDOCW 06: Proc. of the 10th IEEE on Int. Enterprise Distributed Object Computing Conf. Workshops, IEEE Computer Society, Washington, DC, USA, p. 19. Lerner, B.S., A model for compound type changes encountered in schema evolution. ACM Trans. Database Syst. 25 (1), Malý, J., Mlýnková, I., Nečaský, M., XML data transformations as schema evolves. In: ADBIS 11: Proc. of the 15th Advances in Databases and Information Systems, Springer-Verlag, Vienna, Austria. Mens, T., Van Gorp, P., A taxonomy of model transformation. Electron. Notes Theor. Comput. Sci. 152, Miller, J., Mukerji, J., MDA Guide Version 1.0.1, Object Management Group. URL: Mlýnková, I., Adaptive XML-to-relational storage strategies. In: Handbook of Research on Innovations in Database Technologies and Applications: Current and Future Trends. Idea Group Publishing, pp (February). Moro, M.M., Malaika, S., Lim, L., Preserving XML queries during schema evolution. In: WWW 07: Proc. of the 16th Int. Conf. on World Wide Web, ACM, New York, NY, USA, pp Murata, M., Lee, D., Mani, M., Kawaguchi, K., Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Tech. 5 (4), Murata, M., RELAX (Regular Language Description for XML), ISO/IEC DTR URL: Murzek, M., Kramler, G., Michlmayr, E., Structural patterns for the transformation of business process models. In: EDOCW 06: Proc. of the 10th IEEE on Int. Enterprise Distributed Object Computing Conf. Workshops, IEEE Computer Society, Washington, DC, USA, p. 18. Nečaský, M., Mlýnková, I., 2009a. On different perspectives of XML schema evolution. In: FlexDBIST 09: Proc. of the 5th Int. Workshop on Flexible Database and Information System Technology, IEEE Computer Society, Linz, Austria. Nečaský, M., Mlýnková, I., 2009b. Five-level multi-application schema evolution. In: DATESO 09: Proc. of the Databases, Texts, Specifications, and Objects, Matfyz Press, April, pp Nečaský, M., Mlýnková, I., When conceptual model meets grammar: a formal approach to semi-structured data modeling. In: Chen, L., Triantafillou, P., Suel, T. (Eds.), WISE 10: Web Information Systems Engineering, vol. 6488, LNCS. Springer, Berlin, Heidelberg, pp Nečaský, M., Mlýnková, I., A framework for efficient design maintaining, and evolution of a system of XML Applications. In: DATESO 10: Proc. of the Databases, Texts, Specifications, and Objects, MatfyzPress, April, pp Nečaský, M., Opočenská, K., Designing and maintaining XML integrity constraints. In: MoViX 09: Proc. of the 1st Int. Workshop on Modelling and Visualization of XML and Semantic Web Data, IEEE Computer Society, Linz, Austria. Nečaský, M., Klímek, J., Kopenec, L., Kučerová, L., Malý, J., Opočenská, K., XCase A Tool for XML Data Modeling. URL: Nečaský, M., Conceptual modeling for XML. Dissertations in Database and Information Systems, vol. 99. IOS Press, Amsterdam, The Netherlands. Nguyen, H.-Q., Taniar, D., Rahayu, J.W., Nguyen, K., Double-layered schema integration of heterogeneous XML sources. J. Syst. Software 84 (1), Noy, N.F., Klein, M., Ontology evolution: not the same as schema evolution. Knowl. Inf. Syst. 6 (4),

M. Nečaský et al. / The Journal of Systems and Software 85 (2012) 683 707 707 Object Management Group, 2007a. UML Infrastructure Specification 2.1.2 (Nov 2007). URL: http://www.omg.org/spec/uml/2.1.2/infrastructure/pdf/.

0, OMG, 2009. URL: http://www.omg.org/technology/documents/formal/ocl.htm. OMG, 2005. MOF QVT Final Adopted Specification, Object Modeling Group (June 2005).

of the Databases, Texts, Specifications, and Objects, pp. 1 12, http://ceur-ws.org. Park, C.-S., Park, S., 2008. Efficient execution of composite web services exchanging intensional data. Inform. Sci.

139 M. Nečaský et al. / The Journal of Systems and Software 85 (2012) Object Management Group, 2007a. UML Infrastructure Specification (Nov 2007). URL: Object Management Group, 2007b. UML Superstructure Specification (Nov 2007). URL: Object Constraint Language Specification, version 2.0, OMG, URL: OMG, MOF QVT Final Adopted Specification, Object Modeling Group (June 2005). URL: qvt final.pdf. Opočenská, K., Kopecký, M., Incox a Language for XML Integrity Constraints Description. In: DATESO 08: Proc. of the Databases, Texts, Specifications, and Objects, pp. 1 12, Park, C.-S., Park, S., Efficient execution of composite web services exchanging intensional data. Inform. Sci. 178 (2), Passi, K., Morgan, D., Madria, S., Maintaining integrated XML schema. In: IDEAS 09: Proc. of the 2009 Int. Database Engineering, Applications Symp., ACM, New York, NY, USA, pp Ravichandar, R., Narendra, N.C., Ponnalagu, K., Gangopadhyay, D., Morpheus: semantics-based incremental change propagation in SOA-based solutions. IEEE Int. Conf. on Services Computing 1, Ryu, S.H., Casati, F., Skogsrud, H., Benatallah, B., Saint-Paul, R., Supporting the dynamic evolution of web service protocols in Service-Oriented Architectures. ACM Trans. Web 2 (2), Simanovsky, A.A., Data schema evolution support in XML-relational database systems. Program. Comput. Softw. 34 (1), Sindhgatta, R., Sengupta, B., An extensible framework for tracing model evolution in SOA solution design. In: OOPSLA 09: Proc. of the 24th ACM SIGPLAN Conf. Companion on Object Oriented Programming Systems Languages and Applications, ACM, New York, NY, USA, pp Stevens, P., Bidirectional model transformations in QVT: semantic issues and open questions. Soft. Syst. Model. 9 (1), Tan, M., Goh, A., Keeping pace with evolving XML-based specifications. In: EDBT 04 Workshops, Springer, Berlin, Heidelberg, pp Tekli, J., Chbeir, R., Yetongnon, K., Extensible user-based XML grammar matching. In: Proc. of the 28th Int. Conf. on Conceptual Modeling, ER 09, Springer-Verlag, Berlin, Heidelberg, pp Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N., XML Schema Part 1: Structures, second ed. W3C, URL: Wojnar, A., Mlýnková, I., Dokulil, J., Structural and semantic aspects of similarity of document type definitions and XML schemas. Inform. Sci. 180 (10), (Special Issue on Intelligent Distributed Information Systems). Web Services Business Process Execution Language (WSBPEL) TC, OASIS, URL: home.php%3fwg abbrev=wsbpel. Xiong, Y., Hu, Z., Zhao, H., Song, H., Takeichi, M., Mei, H., Supporting automatic model inconsistency fixing. In: ESEC/FSE 09: Proc. of the 7th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp, ACM, New York, NY, USA, pp MOF 2.0/XMI Mapping Specification, v2.1.1, OMG, URL: Yu, C., Popa, L., Semantic adaptation of schema mappings when schemas evolve. In: VLDB 05: Proc. of the 31st Int. Conf. on Very Large Data Bases, VLDB Endowment, pp Martin Nečaský received his Ph.D. degree in Computer Science in 2008 from the Charles University in Prague, Czech Republic, where he currently works at the Department of Software Engineering as an assistant professor. He is an external member of the Department of Computer Science and Engineering of the Faculty of Electrical Engineering, Czech Technical University in Prague. His research areas involve XML data design, integration and evolution. He is an organizer/pc chair/member of more than 10 international events. He has published more than 30 papers (two received Best Paper Award). He has published 3 book chapters and a book. Jakub Klímek received his Master s degree in Computer Science in September 2009 from the Charles University in Prague, Czech Republic, where he currently is a Ph.D. student at the Department f Software Engineering. His research areas involve XML data design, integration and evolution. He has published 12 refereed conference papers. He is a co-organizer of 1 local workshop. Jakub Malý received his Master s degree in Computer Science in June 2010 from the Charles University in Prague, Czech Republic, where he currently is a Ph.D. student at the Department of Software Engineering. His research areas involve conceptual modeling of XML data and evolution of XML applications. He has published 7 refereed conference papers. Irena Mlýnková received her Ph.D. degree in Computer Science in 2007 from the Charles University in Prague, Czech Republic. She is an assistant professor there and an external member of the Department of Computer Science and Engineering of the Czech Technical University. She has published more than 40 publications (16 recorded in WoS), 4 gained the Best Paper Awards. Her research areas involve management of XML data, structural similarity, analysis of real-world data, synthesis of XML data, XML benchmarking, XML schema inference and application evolution. She is a PC member/reviewer of 14 international events and co-organizer of 3 international workshops.

140 128

141 Chapter 6 Formal Evolution of XML Schemas with Inheritance Jakub Klímek Martin Nečaský Published in the Proceedings of 2012 IEEE 19th International Conference of Web Services, Honolulu, HI, July IEEE, ISBN Pages

142 130

143 2012 IEEE 19th International Conference on Web Services Formal Evolution of XML Schemas with Inheritance Jakub Klímek, Martin Nečaský XML and Web Engineering Research Group (XRG) Charles University in Prague Prague, Czech Republic Abstract Today, web services are widely used for data exchange. The format of individual messages exchanged among them is usually described with XML schemas in WSDL documents. It is a common practice that there is not only one but a whole family of formats each, for example, tailored for a specific consumer. In such environments, the design and maintenance of the web service interfaces and, in particular, the XML schemas describing the structure of messages is not a simple task. In our previous work we developed a method based on the principles of Model-Driven Development for evolution of a family of XML schemas. It automated a portion of design and maintenance tasks to be done when a change in user requirements or surrounding environment of the system influences more XML schemas in the family. We provided a formal model of possible evolution changes and their propagation mechanism. In this paper, we extend our method with inheritance modeling. We formally extend our conceptual model and we introduce new evolution changes and update the current ones so that they keep the model in a consistent state. Index Terms XML schema modeling, model driven architecture, XML schema evolution, propagation of changes, inheritance I. INTRODUCTION In the world of web services (and beyond) XML is used to exchange data as XML messages. Their formats are usually described using XML schemas encompassed in WSDL descriptions. In a typical system there is a set of different XML schemas (e.g. one for each service provider or consumer). However, their data domain is often similar. This is what we call a family of XML schemas. With a system which uses a family of XML schemas we face the problem of XML schema evolution. The XML schemas need to be evolved whenever user requirements or surrounding environment changes and a single change may influence multiple schemas. Without a proper technique, we have to identify the XML schemas affected by the change manually and ensure that they are evolved coherently with each other. In our previous work [1] we tackled this problem. We proposed a technique based on the Model-Driven Development (MDD). We model XML schemas at two levels platform-independent model (PIM) and platform-specific model (PSM). The application data domain is modeled in the PIM schema and each XML format is designed as a PSM schema which is mapped to the PIM schema on a detailed level (individual classes and attributes). This provides means to comfortable evolution management as a change is expressed in the PIM schema or one of the PSM schemas and the mappings allow us to propagate the change between PIM and PSM levels semi-automatically. In this paper, we extend our approach with modeling and evolving inheritance on both levels and we bridge the gap between the conceptual modeling world (conceptual inheritance) and the schema design world (structural inheritance). Motivation Let us shortly discuss the two inheritance types. The structural inheritance means that we want to reuse a part of a schema for different concepts. For example, we can have an address containing street name and country attributes. We want to use these among others within a description of a customer and within a description of a letter. A customer is in no conceptual relationship to a letter, except they both have an address. With conceptual inheritance, the child inherits all characteristics of the parent too, but there is also a conceptual relationship. An example from biology: As a parent, we have a mammal and as its children we can have a cat and a human. In this type of inheritance, being an instance of the child also implies being an instance of the parent. This is in contrast with the structural inheritance, where being a child does not imply being a parent. In conceptual modeling languages like UML we find support for the conceptual inheritance and in data modeling languages like XML Schema we find the structural inheritance. Because our model s goal is to bridge the gap, we support both types of inheritance in an intuitive manner. The reason for bridging the gap is that we use our model as a part of a larger framework [2] incorporating more than just XML as a target platform and we chose UML class diagrams as a universal platform-independent modeling language. On the platform-specific level, we need do be able to model whatever the target language offers and in XML and the XML Schema language it is type extension (structural inheritance) in contrast to the conceptual inheritance present in UML class diagrams. For the model to be usable, it needs well defined operations that assure the model consistency during its evolution in time. Outline In Section II we define our conceptual model with inheritance. In Section III we complement the model with a set o operations for inheritance management and in Section IV we describe their propagation. Section V describes our implementation of the approach. Section VI contains evaluation. Section VII contains a brief survey of related work. Section VIII concludes /12 $ IEEE DOI /ICWS

144 II. MODELING FOR XML EVOLUTION In this section, we briefly introduce our conceptual model for XML (see [3] for full description) and its inheritance extension. As said before, we use two levels, PIM and PSM. A. Platform-Independent Model A PIM schema S is based on UML class diagrams and models real-world concepts and relationships between them. It contains three types of components: classes, attributes and associations with the usual semantics. A sample PIM schema is in Figure 1(a). Formally, we define it in Definition 1. Definition 1. A platform-independent (PIM) schema is a triple S =(S c, S a, S r ) of disjoint sets of classes, attributes, and associations, respectively. Class C S c has a name assigned by function name, isa assigns a parent class to a child class. Functions abstract and final determine whether the class can have instances in data and whether this class can be inherited from, respectively. Attribute A S a has a name, data type and cardinality assigned by functions name, type, and card, respectively. Moreover, A is associated with a class from S c by function class. Association R S r is a set R = {E 1,E 2 }, where E 1 and E 2 are called association ends of R. R has a name assigned by function name. Both E 1 and E 2 have a cardinality assigned by function card and are associated with a class from S c by function participant. We will call participant(e 1 ) and participant(e 2 ) participants of R. name(r) may be undefined, denoted by name(r) =λ. For a class C S c, we will use attributes (C ) to denote the set of all attributes of C. Similarly, associations (C ) will denote the set of all associations with C as a participant. Definition 2. Let C S c and let isa (λ) ={} and isa (C) = {isa(c)} isa (isa(c)) where C λ. It must hold that C/ isa (C) (i.e. the relation does not form a cycle) B. Platform Specific Model The platform-specific model (PSM) specifies how a part of the reality is represented in XML. Its advantage is that the designer works in a UML-style way which is more comfortable then editing the XML schema. Formally, PSM schema is defined by Definition 3. An example is in Figure 1(b). Note that a PSM schema can be automatically translated to a target XML schema language (see [3]). In Figure 1(c) there is a sample XML document with its structure modeled by the PSM schema. Definition 3. A platform-specific (PSM) schema is a tuple S =(S c, S a, S r, S m, C S ) of disjoint sets of classes, attributes, associations, and content models, respectively, and one specific class C S S c called schema class. Class C S c has a name assigned by function name, a conceptual inheritance parent by function isa and can be marked as final or abstract Attribute A S a has a name, data type, cardinality and XML form assigned by functions name, type, card and xform, respectively. xform(a ) {e, a}. Moreover, it is associated with a class from S c by function class and has a position assigned by function position within the all attributes associated with class(a ). Association R S r is a pair R =(E 1,E 2), where E 1 and E 2 are called association ends of R. Both E 1 and E 2 have a cardinality assigned by function card and each is associated with a class from S c or content model from S m assigned by function participant, respectively. We will call participant(e 1) and participant(e 2) parent and child and will denote them by parent(r ) and child(r ),respectively. Moreover, R has a name assigned by function name and has a position assigned by function position within the all associations with the same parent(r ). name(r ) may be undefined, denoted by name(r )=λ. Content model M S m has a content model type assigned by function cmtype. cmtype(m ) {sequence, choice, set}. The graph (S c S m, S r) must be a forest of rooted trees with one of its trees rooted in C S.ForC S c, attributes (C ) will denote the sequence of all attributes of C ordered by position, i.e. attributes (C )=(A i S a : class(a i )=C i = position(a i )). Similarly, content (C ) will denote the sequence of all associations with C as a parent ordered by position, i.e. content (C ) = (R i S r : parent(r i ) = C i = position(r i )). We will call content (C ) content of C. For structural inheritance, we introduce a structural representative (repr) function. This function specifies that a PSM class C has a reference to another PSM class C r. It means that the complex content modeled by C contains all the attributes and content of C and all the attributes and content of C r. However, because this is structural inheritance, C can not be used where C r is used. In our visualization we write the name of the referenced class on top of the name of the referencing class, which is blue. Because of the nature of the structural inheritance which does not allow the usage of a child where its parent is used, we need another construct for expressing the conceptual inheritance in a PSM schema. For that, we introduce an isa function. This function specifies that a PSM class C is an conceptual inheritance child of another PSM class C p. This also means that the complex content modeled by C contains all the attributes and content of C and all the attributes and content of its parent, C p, but in addition, it means that wherever the content modeled by C p is used, the content modeled by C can also be used. In our visualization we use the usual inheritance arrow known from UML. The following definition ensures that nor structural nor conceptual inheritance relations form a cycle (separately and combined). Let us demonstrate on a short example. Let C,D S c and isa(c )=D repr(d )=C. This would mean that D contains everything that C does and vice versa infinitely, because both types of inheritance mean content reuse, only with different semantics. We need to avoid that. 497

145 ShippingAddress gps GlobalAddress country PurRQSchema purchaserq <purchaserq version="1.1.4."> <cust> <name>jakub Klímek</name> <addr> <street>malostranské náměstí 25</street> <city>prague</city> <country>czech Republic</country> <gps> , </gps> </addr> </cust> <items> <item tester="no"> <code>it1234</code> <title>sample 4</title> <price>20 </price> <amount>4</amount> </item> </items> </purchaserq> LocalAddress Address street city cust items has Customer Items Supply Product Customer name item 0..* amount title name addr 1..* supply-price price {1..*} Item date code phone {0..*} Contact Address 0..* {1..*} street makes 0..* 1..* city Supplier phone {0..*} ProductBase Item Purchase Product tester code name item-price 1..* create-date {1..*} LocalAddress ShippingAddress ItemTester ItemPricing amount status price gps amount (a) (b) (c) Fig. 1. Example of a PIM and a PSM schema with inheritance and a sample corresponding XML document Definition 4. Let C,D S c and let repr (λ) ={} and repr (C ) = {repr(c )} repr (repr(c )) where C λ. It must hold that C / repr (C ). Let isa (λ) = {} and isa (C )={isa(c )} isa (isa(c )) where C λ. It must hold that C / isa (C ). In addition, repr and isa must not form a cycle when combined (e.g. isa (C ) = D repr(d ) C in the simplest case). Formally, let isarepr (λ) = {} and isarepr (C ) = {isa(c )} isa (isa(c )) {repr(c )} repr (repr(c )) where C λ. It must hold that C / isarepr (C ). C. Interpretation of PSM schema against PIM schema A PSM schema represents a part of a PIM schema. A class, attribute or association in the PSM schema may be mapped to a class, attribute or association in the PIM schema. The mapping specifies the semantics of classes, attributes and associations of the PSM schema in terms of the PIM schema. The mapping must meet certain conditions to ensure consistency between PIM schemas and the specified semantics of the PSM schema. This mapping then allows for interesting use cases for the conceptual model like XML schema evolution and integration [4], [5], [1]. In addition, we will use S r to denote the set of all ordered images of associations of S. These are normal PIM associations, but for the interpretation we specify the direction in which they are used. An arbitrary interpretation of a PSM component would lead to inconsistencies between the semantics of the PIM schema and the semantics of the PSM schema. Before we introduce the rules that prevent those inconsistencies, let us define the notion of interpreted context of a PSM component which we need for the rules. The PSM schema can contain some uninterpreted classes, attributes and associations. This means that those have no meaning on the platform-independent level, but they are used in the XML format. Definition 5 says that if a PSM component is a class and has an interpretation, then it is its own interpreted context. The other components exist in a semantic context of the nearest ancestor class which has an interpretation (semantic equivalent on the PIM level) and that class is their interpreted context. Definition 5. An interpretation of a PSM schema S against a PIM schema S is a partial function I : S c S a S r S c S a S r.forx S c S a S r, we call I(X ) interpretation of X. The interpreted context of X with respect to I is denoted intcontext(x ) and intcontext(x )=X when X S c and I(X ) λ intcontext(x )=C when X S c or I(X )=λ, where C is the closest ancestor class to X s.t. I(C ) λ. Definition 6. Let I be an interpretation a PSM schema S against a PIM schema S. Let C 1 C 2 be a shortcut for C 1 = C 2 or C 1 is an ancestor of C 2 regarding inheritance in a PIM schema. We say that I is consistent if the following rules are satisfied: ( A S a s.t. I(A ) λ) (1) (intcontext(a ) λ class(i(a )) I(intcontext(A ))) ( R S r (2) s.t. I(child(R )) λ intcontext(r ) λ) (I(R )=(E 1,E 2 ) s.t. participant(e 1 ) I(intcontext(R )) participant(e 2 ) I(child(R ))) ( R S r s.t. I(child(R )) = λ (3) I(intcontext(R )) = λ)(i(r )=λ) ( C S c s.t. repr(c ) λ I(C ) λ)(i(c ) (4) I(repr(C ))) ( C S c s.t. isa(c ) λ)(i(isa(c )) I(C )) (5) ( C S c s.t. abstract(c )=true) (6) (abstract(i(c )) = true) ( C S c s.t. final(c )=true) (7) (final(i(c )) = true) Rule 1 says that the interpretation (a PIM class) of an interpreted context (which is a PSM class) of each attribute A that has an interpretation is the same ore less general (in a conceptual inheritance sense) than the attributes interpretations (a PIM attribute) class (a PIM class). The second rule is similar, 498

146 but for associations. Rule 3 says that if an association does not have an interpreted context or its child is not interpreted, it is not interpreted either. Note that in the context of mapping of a PSM schema to a PIM schema (interpretation construction), the content models present in a PSM schema are irrelevant as they do not influence the semantics of PSM classes, attributes nor associations. In addition, we need to consider our two additional relations that classes can have (structural and conceptual inheritance). Rule 5 states that two classes involved in a structural inheritance relationship must have the same interpretation (as they share the same content). Besides that, we need to take into account the possibility that the class from which we inherit content and attributes may have conceptual inheritance children that may replace it. Finally, Rule 5 states that if two classes are in an conceptual inheritance relationship in the PSM schema, their interpretations must be the same or the parent s interpretation must be an ancestor of the child s interpretation in the PIM schema. This is what we call an explicit inheritance, as we see the inheritance function in the PSM schema. D. Conceptual model summary In summary, the usefulness of our conceptual model for XML can be clearly seen when we, for example, ask questions like In which of our hundred XML schemas used in our system is the concept of a customer represented? and What impact on my XML schemas would this particular change on the conceptual level have?. Even better, with our extensions for evolution of XML schemas [4], [1] we can make changes to the PIM schema (e.g. change the representation of a customer s name from one string to firstname and lastname) and those changes can be automatically propagated to all the affected PSM schemas. Thanks to automated translations from PSM schemas to, for example, XML Schema and back [3], we can easily manage a whole system of XML schemas from the conceptual level all thanks to the interpretations. These extensions, however, are not trivial and are not in the scope of this paper, where we only deal with modeling of inheritance and with operations for its evloution. In addition, the conceptual model can serve as a quality documentation of the XML schemas, because it is clear to which concepts individual schema parts relate. Also, it would be possible to generate a clickable HTML documentation of a system modeled in exolutio. With the model, it is also much easier and faster to grasp a system of multiple XML schemas when, for example, negotiating interfaces between two information systems. III. ATOMIC OPERATIONS In this section, we extend atomic operations for editing PIM and PSM schemas which we introduced in our previous work [1] with inheritance evolution, which is the main contribution of this paper. The atomic operations serve as a formal basis for describing user-friendly operations composed of them. Formally, we suppose a PIM schema S and a set of PSM schemas PSM = {S 1,..., S n}, where each S i has an interpretation I i against S. We also consider one specific PSM schema S from this set with an interpretation I against S. For each atomic operation we specify input parameters together with a precondition and postcondition. If a precondition is not satisfied, the operation cannot be performed. The postcondition describes the effect of the operation. When an operation is executed on S or S, we say that the schema evolved to a new version. This is denoted S + or S +, respectively. The new version of the interpretation will be denoted I +.In[1] we classified atomic operations into 4 categories: creation of classes, attributes and associations (denoted by the Greek letter α), their update (denoted by the Greek letter υ) and removal (denoted by the Greek letter δ) and we introduced a special synchronization operation (denoted by the Greek letter σ). For their detailed description we refer the reader to our previous work [1]. In this paper we extend them with operations for inheritance management. A. Atomic Operations for PIM Schema Inheritance Evolution In [1] we introduced basic operations for creation, change, movement and deletion of classes, attributes and associations. Now we extend these operations with new ones required for proper modeling and evolution of inheritance. The formal commands for the operations and their preconditions and postconditions formalized according to our model are in Table I. Note that so far, we do not describe propagation from the PIM level to the PSM level. There is basically one atomic operation which changes the isa function on PIM classes υc isa (C s,c g ). However, there are four distinct cases of its use, so we describe each case individually, as it makes the description easier to formalize and understand. First, we have the simple addition and removal of generalizations. Formally, this means setting the isa function to a class when it was λ (addition) and setting it to λ when it was set to some class (removal). Next is resetting generalizations setting it to a different class than it was originally set to. The preconditions state that these operations can be done only if there is a generalization between the source and the target general class. This means moving the specific class higher or lower in the inheritance hierarchy. In addition, there are operations which change the abstract and final functions and that is all we need for the management of the inheritance on the PIM level. These operations are trivial so we do not provide examples for them. Then there are special operations for movement of attributes and association ends between classes which are in an inheritance relation. These are different from the operations for simple attribute or association end movement, their semantics and therefore their propagation is different. B. Atomic Operations for PSM Schema Inheritance Evolution In this section we present operations for conceptual inheritance in PSM schemas. The operations are similar to those on the PIM level but there are differences in preconditions as on the PSM level we need do coordinate conceptual and structural inheritance. In Table II we summarize notation, description, 499

147 Notation Description Precondition Postcondition υc isa (C s,c g) Sets conceptual inheritance relation. C g is the C g,c s S c C g C s final(c g)= general class and C s is the specialized class false C s / isa (C g) isa(c s)=λ isa + (C s)=c g C g λ υc isa (C s,λ) Unsets conceptual inheritance relation. C s S c isa(c s) λ isa + (C s)=λ υc isa (C s,c g) Resets conceptual inheritance relation to a more general class. C g is the more general class, C g,c g0,c s S c C g C s final(c g)= false C s / isa (C g) isa(c s)=c g0 isa + (C s)=c g C g0 is the original general class and C s is the specialized class λ C g λ isa(c g0) =C g υc isa (C s,c g) Resets conceptual inheritance relation to a less general class. C g is the less general class, C g0 C g,c g0,c s S c C g C s final(c g)= false C s / isa (C g) isa(c s)=c g0 isa + (C s)=c g is the original general class and C s is the λ C g λ isa(c g)=c g0 specialized class υc abstract (C, b) Sets the abstract property of a class C to b, abstract + (C) =b which is either true or false υc final (C, b) Sets the final property of a class C to b, which is either true or false (b = true D S c(isa(d) =C)) b = false final + (C) =b υa gen (A, C g) Move attribute to a general class C g A S a C 0,C g S c class(a) =C 0 isa(c 0)=C g class + (A) =C g υa spec (A, C s) Move attribute to a specific class C s A S a C 0,C s S c class(a) =C 0 isa(c s)=c 0 class + (A) =C s υr gen (E,C g) Reconnect association end to a general class C g C 0,C g S c participant(e) = C 0 isa(c 0)=C g participant + (E) =C g υr spec (E,C s) Reconnect association end to a specific class C s C 0,C s S c participant(e) = C 0 isa(c s)=c 0 participant + (E) =C s TABLE I ATOMIC OPERATIONS FOR PIM INHERITANCE MANAGEMENT preconditions and postconditions of operations working with the isa, abstract and final functions and operations for moving attributes and associations through the inheritance hierarchy. We also need to update some operations from [1] to take into account new requirements imposed by our conceptual inheritance extension to the interpretation definition. The redefined operations are in Table III. For example, we redefine the operation for setting a class as a structural representative so it respects the rules of Definition 4. Next, we need to assure that when we delete a PSM class, it is not part of any conceptual inheritance relation. The same goes for setting a PSM classes interpretation. When we do that, we do not want that class to be a part of a conceptual inheritance relation. Finally, we need to allow the implicit inheritance when setting interpretation of attributes and associations. The most complicated redefinition is υ int c (C,C) where we make sure that when we update and interpretation of a class, there are no attributes nor associations whose interpretation relies on the one of C and that C does not participate in any inheritance relations. In the precondition, anc is an ancestor in the PSM tree. IV. PROPAGATION OF ATOMIC OPERATIONS An interpretation of a PSM schema S against a PIM schema S must be consistent. When S or S is modified by an atomic operation, one or more conditions necessary for consistency may be violated and, consequently, the interpretation or the other schema must be adapted accordingly. We call the process which ensures the adaptation propagation of the atomic operation. The propagation of the basic atomic operations was described in detail in our previous work [1] and therefore in this paper we focus on the new operations for inheritance evolution. We only allow changing the inheritance hierarchies on the PIM level as we view the conceptual inheritance as belonging to the PIM level. We therefore restrict the operations on the PSM level to the boundaries set by the PIM schema. Here we describe how the introduced atomic operations executed on the PIM schema S are propagated to each PSM schema S PSMand its interpretation I against S. Creation of a generalization (isa(c s )=λ υc isa (C s,c g λ)) is not propagated at all. When the generalization is set, it is new and therefore it does not have an equivalent on the PSM level yet. Then there is setting of the abstract function saying that a class cannot have instances in data (υc abstract (C, b)). This is a constraint that we can not check on the modeling level and therefore we propagate it straight to the PSM schemas. We set the abstract function to the same value for each PSM class C that has C as an interpretation. Formally, ( C : I(C )= C)(abstract(C )=b). From the PSM schemas this constraint is propagated to the actual PSM schema languages and it is up to their validators to check this constraint. Setting the final function is propagated in the same way as the setting of the abstract function. The propagation of resetting a generalization (isa function) to a more general or a more specific class is almost 1:1, which means that we perform similar operations on the PSM level to maintain consistency. There is one exception when moving generalization to a more general class. The example is in Figure 2. The blue lines represent the interpretation of PSM classes against PIM classes. In the example of resetting the isa function to a more gen- 500

148 Notation Description Precondition Postcondition υ c isa (C s,c g) Sets conceptual inheritance relation. C g is the C g,c s S c \{C S } C g C s C s / isa + (C s)=c g general class and C s is the specialized class isarepr (C g) isa(c s)=λ C g λ υ c isa (C s,λ) Unsets conceptual inheritance relation. C s S c \{C S } isa(c s) λ isa + (C s)=λ υ c isa (C s,c g) Resets conceptual inheritance relation to a more general class. C g is the more general class, C g0 is the original general class and C s is the specialized class C g,c g0,c s S c \{C S } C g C s C s / isarepr (C g) isa(c s)=c g0 λ C g λ isa(c g0) =C g isa + (C s)=c g υ c isa (C s,c g) Resets conceptual inheritance relation to a less general class. C g is the less general class, C g0 is the original general class and C s is the specialized class C g,c g0,c s S c \{C S } C g C s C s / isarepr (C g) isa(c s)=c g0 λ C g λ isa(c g)=c g0 isa + (C s)=c g υ c abstract (C,b) Sets the abstract property of a class C to b, abstract + (C )=b which is either true or false υ c final (C,b) Sets the final property of a class C to b, which (b = true D S c(isa(d )=C )) b = final + (C )=b is either true or false false υ a gen (A,C g) Move attribute to a general class C g A S a C 0,C g S c class(a )=C 0 class + (A )=C g isa(c 0)=C g υ a spec (A,C s) Move attribute to a specific class C s A S a C 0,C s S c class(a )=C 0 class + (A )=C s isa(c s)=c 0 υ r gen (R,C g) Reconnect parent association end of association C 0,C g S c parent(r )=C 0 isa(c 0)= parent + (R )=C g R to a general class C g C g υ r spec (R,C s) Reconnect parent association end of association R to a specific class C s C 0,C s S c parent(r )=C 0 isa(c s)= C 0 parent + (R )=C s TABLE II ATOMIC OPERATIONS FOR PSM INHERITANCE MANAGEMENT Notation Description Precondition Postcondition υ repr c (C,C r) Set class C as structural representative C S c \{C S } (C r = λ (C r S c \{C S } I(C ) = repr + (C ) = of C r I(C r) C / isarepr (C r))) C r δ c(c ) Remove class C C S c attributes(c ) = content (C ) = ( C 0 C S + S c)(repr(c 0)=C ) ( C 1 S c)(isa(c 1)=C ) υ int c (C,C) Update interpretation of class C to class C C S c \{C S } (C = λ C Sc) ( A S a s.t. intcontext(a )=intcontext(c ) C anc(a )) (I(A )=λ) ( R S r s.t. (intcontext(r )=intcontext(c ) C anc(r )) child(r )=C ) I + (C )=C (I(R ) = λ) ( C 0 S c)(repr(c 0) = C ) repr(c ) = λ ( C 1 S c)(isa(c 1)=C ) isa(c )=λ υ int a (A,A) Update interpretation of attribute A S a (A = λ (A S a class(a) I(intcontext(A )))) I + (A )=A A to attribute A υ int r (R,O) Update interpretation of association R to ordered image O of association R S r child(r ) S c (O = λ (O = (E 1,E 2) participant(e 1) I(intcontext(R )) participant(e 2) I(child(R )))) R TABLE III BASIC ATOMIC OPERATIONS WITH INHERITANCE UPDATE I + (R )=O eral class, the new general class for ShippingAddress is LocalAddress. On the PSM level, we already have the Address class a general class to Shipping, so it would seem that no propagation is necessary. However, we have an attribute Shipping.country and I(Shipping.country ) = GlobalAddress.country. This is the use of implicit inheritance (Definition 6, rule 1) and that rule would be broken. Therefore, we need to create GlobalAddress on the PSM level and keep the country attribute there. The opposite direction - moving a generalization to a more specific class is really a 1:1 propagation - we simply do the PSM counterpart operation. Another easy operation is removal of a generalization, which is in fact setting the isa function to λ (υc isa (ShippingAddress,λ)). When the PIM generalization is removed, we simply remove its PSM counterpart. Formally, ( C S c : I(C ) = C s )(isa(c )=λ). Next are the operations for movement of attributes and associations through the inheritance hierarchy. Because the cases for attributes and associations are similar, we show here only the cases for the attributes. We start with moving an attribute to a more general class. In the first case the more general class is also present in the PSM schema and the movement of the affected attribute follows the movement in the PIM schema. In the second case, the schema 501

PIM before: PSM before: PIM before: PSM1 before: PSM2 before: LocalAddress street city Address street city LocalAddress street city country street city country LocalAddress Address street city

city PIM after: LocalAddress street city PSM1 after: LocalAddress PSM2 after: street city Address GlobalAddress country GlobalAddress country gps Shipping GlobalAddress ShippingAddress GlobalAddress

149 PIM before: PSM before: PIM before: PSM1 before: PSM2 before: LocalAddress street city Address street city LocalAddress street city country street city country LocalAddress Address street city country GlobalAddress country Shipping GlobalAddress ShippingAddress gps country gps ShippingAddress ShippingAddress gps gps PIM after: LocalAddress street city PSM after: Address street city street city PIM after: LocalAddress street city PSM1 after: LocalAddress PSM2 after: street city Address GlobalAddress country GlobalAddress country gps Shipping GlobalAddress ShippingAddress GlobalAddress ShippingAddress gps country country gps country Fig. 2. Propagation of moving a generalization to a more general class. ShippingAddress gps remains unchanged, because the more general class is not present in the PSM schema and the movement does not violate rule 1 of Definition 6. Finally, there is moving an attribute to a more specific class. This is demonstrated in Figure 3. Again, we have two cases. In PSM1, there is a more specific class ShippingAddress, whose interpretation inherits the target, more specific class GlobalAddress through interpretation. Formally, GlobalAddress ShippingAddress = I(ShippingAddress ). Therefore we move the corresponding attribute there. It there was no such class, we would have to create one, as can be seen in PSM2. Fig. 3. Propagation of moving an attribute to a specific class. V. IMPLEMENTATION We have implemented the proposed inheritance extension in a tool called exolutio 1. It is a proof-of-concept desktop application for conceptual XML data modeling (screenshot in Figure 4). It implements the PIM and PSM modeling languages described in [3] and operations for evolution of the PIM and PSM schemas described in [1]. It provides a designer with a set of operations which are composed of the atomic operations described earlier and it implements their propagation. At the highest level, exolutio is based on the Model View Controller (MVC) design pattern. For the purpose of this paper, the atomic operations are implemented in the exact same way they are described here. We use the implementation to experimentally demonstrate that the proposed set of atomic operations is complete, i.e. that the 1 Fig. 4. exolutio screenshot atomic operations are sufficient for real-world situations. We do not prove completeness formally in this paper. VI. EVALUTATION We have evaluated the inheritance extension to our conceptual model as we modeled a medium-sized family of XML schemas of the Data Standard for ehealth in the Czech Republic (DASTA) 2. Due to lack of space we only summarize our results. The PIM schema contained more that 70 classes, 100 associations including conceptual inheritance relations and hundreds of attributes. Mapped to the PIM schema were 12 XML formats (PSM schemas). Since 2006, the format has (in Czech only) 502

150 versions since it is evolved approximately four times a year. Our approach improved the speed with wich programmers were able to orientate themselves in the schemas and it also helped in revealing inconsistencies in the XML schemas. As to the evolution operations, we picked one of the evolution steps and performed the changes in exolutio using our operations formalism and we confirmed that up to 60% of operations that would have to be done manually by a domain expert can be done automatically using our approach. VII. RELATED WORK The XML schema languages for specification of XML schemas are not very user-friendly. Therefore, approaches for designing XML schemas at a conceptual level were introduced. In comparison to our approach, none of the approaches considers a formal binding between XML schemas and conceptual schemas like our interpretation. They only show how a conceptual schema (ER or UML) is translated to an XML schema or vice versa but they do not consider the case when more XML schemas need to be designed which is the motivation for our conceptual model and schema evolution. Therefore, the other approaches are limited when it comes to modeling multiple schemas and in particular the conceptual and structural inheritance hierarchies. For work related to conceptual modeling of XML in general see [3], [6], [7], [8]. Evolution management The current approaches towards evolution management can be classified according to distinct aspects [9], [10]. The changes and transformations can be expressed [11], [12] as well as divided [13] variously too. To our knowledge there exists no general framework comparable to our conceptual model; particular cases and views of the problem have previously only been solved separately, superficially and mostly imprecisely without any theoretical or formal basis. For a full survey of work related to schema evolution management refer to [1]. The need for simple, welldefined operations for change management has been identified for example in [14]. Evolution of inheritance The problem of inheritance evolution in XML schemas has been also identified in [15] as their future work. In [16], inheritance in UML to XML Schema translation is mentioned, however, the authors do not say how they translate UML generalization change to a change in XML Schema. To our best knowledge, there is no other approach focusing on inheritance evolution management in XML. In [17], the authors propose a metric for improving modifiability of class inheritance hierarchies. VIII. CONCLUSIONS In this paper we briefly described our conceptual model for XML and our approach to evolution of XML schemas described by this model than can be applied to the common case of multiple web service interfaces with common data domain. We described how inheritance is modeled and then we extended our formal model of operations with the ones regarding inheritance management and their propagation. We did this formally and on examples. We briefly mentioned implementation, evaluation and related work. ACKNOWLEDGEMENT This work was supported in part by the Czech Science Foundation (GAČR), grant number P202/11/P455 and in part by the grant SVV REFERENCES [1] M. Nečaský, J. Klímek, J. Malý, and I. Mlýnková, Evolution and Change Management of XML-based Systems, Journal of Systems and Software, vol. 85, no. 3, pp , [Online]. Available: [2] M. Nečaský and I. Mlýnková, On Different Perspectives of XML Schema Evolution, in FlexDBIST 09: Proc. of the 5th Int. Workshop on Flexible Database and Information System Technology. Linz, Austria: IEEE Computer Society, [3] M. Nečaský, I. Mlýnková, J. Klímek, and J. Malý, When conceptual model meets grammar: A dual approach to XML data modeling, Data & Knowledge Engineering, vol. 72, no. 0, pp. 1 30, [Online]. Available: pii/s x x [4] J. Klímek and M. Nečaský, Integration and Evolution of XML Data via Common Data Model, in Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, March 22-26, New York, NY, USA: ACM, [5] J. Klímek and M. Nečaský, Generating Lowering and Lifting Schema Mappings for Semantic Web Services, in 25th IEEE International Conference on Advanced Information Networking and Applications Workshops, WAINA 2010, Biopolis, Singapore, March IEEE Computer Society, [6] M. Nečaský, Conceptual Modeling for XML: A Survey, in DATESO, ser. CEUR Workshop Proc., V. Snasel, K. Richta, and J. Pokorny, Eds., vol CEUR-WS.org, [Online]. Available: [7] A. Yu, An Overview of Research on Reverse Engineering XML Schemas into UML Diagrams, in ICITA 05 Volume 2-Volume 02, ser. ICITA 05. Washington, DC, USA: IEEE Computer Society, 2005, pp [8] M. Bernauer, G. Kappel, and G. Kramler, Representing XML Schema in UML A Comparison of Approaches, in Web Engineering, ser. Lecture Notes in Computer Science, N. Koch, P. Fraternali, and M. Wirsing, Eds. Springer, 2004, vol. 3140, pp [9] T. Mens and P. Van Gorp, A Taxonomy of Model Transformation, Electron. Notes Theor. Comput. Sci., vol. 152, pp , [10] K. Czarnecki and S. Helsen, Feature-Based Survey of Model Transformation Approaches, IBM Syst. J., vol. 45, no. 3, pp , [11] OMG, MOF QVT Final Adopted Specification, Object Modeling Group, June [Online]. Available: qvt final. pdf [12] A. Boronat, J. A. Carsí, and I. Ramos, Algebraic Specification of a Model Transformation Engine, in FASE 06: Proc. of the 9th Int. Conf. Fundamental Approaches to Software Engineering, Vienna, Austria, ser. LNCS, vol Springer, 2006, pp [13] A. Cicchetti, D. D. Ruscio, and A. Pierantonio, Managing Dependent Changes in Coupled Evolution, in Proc. of the 2nd Int. Conf. on Model Transformations, ICMT 2009, Zurich, Switzerland, ser. LNCS, vol Springer, 2009, pp [14] M. Tan and A. Goh, Keeping Pace with Evolving XML-Based Specifications, in EDBT 04 Workshops. Berlin, Heidelberg: Springer, 2005, pp [15] F. Cavalieri, EXup: an Engine for the Evolution of XML Schemas and Associated Documents, in EDBT 10: Proc. of the 2010 EDBT/ICDT Workshops. New York, NY, USA: ACM, 2010, pp [16] E. Domínguez, J. Lloret, A. L. Rubio, and M. A. Zapata, Evolving XML Schemas and Documents Using UML Class Diagrams, in DEXA 05: Proc. of the 16th Int. Conf. on Database and Expert Systems Applications, ser. LNCS, vol Springer, 2005, pp [17] F. T. Sheldon, K. Jerath, and H. Chung, Metrics for maintainability of class inheritance hierarchies, Journal of Software Maintenance, vol. 14, no. 3, pp , May [Online]. Available: 503

Chapter 7 Model-driven Approach to modeling and validating integrity constraints for XML with OCL and Schematron Jakub Malý Martin Nečaský Published

151 Chapter 7 Model-driven Approach to modeling and validating integrity constraints for XML with OCL and Schematron Jakub Malý Martin Nečaský Published in the Information Systems Frontiers, Springer, DOI /s ISSN Impact Factor: Year Impact Factor:

152 140

153 Inf Syst Front DOI /s Model-driven approach to modeling and validating integrity constraints for XML with OCL and Schematron Jakub Malý Martin Nečaský Springer Science+Business Media New York 2013 Abstract The idea behind Model Driven Development (MDD) (Miller and Mukerji 2003) is to model the software system on several layers of abstraction. A designer starts from the very abstract specification (independent of the platform and language used) and progresses to more concrete models (using platform-specific constructs) and finally to code. Ideally, each step of the transformation of the model from the more abstract to the less abstract is achieved by a declarative transformationobtained (semi-)automatically. In our previous work, we have developed an approach for designing XML schemas based on MDD. We showed that a set of XML schemas representing different views of the same problem domain can be first modeled in a platform-independent level with a uniform conceptual schema expressed as a UML class diagram. Then each XML schema can be modeled as a view on this uniform UML class diagram. In this paper, we further extend our approach to modeling XML schemas using UML class diagrams with modeling integrity constraints using Object Constraint Language (OCL). We show that an integrity constraint expressed at the platform-independent level as an OCL expression can be translated to an expression at the XML schema level which can be used to validate XML documents. In particular, we propose a method which translates an OCL expression at the platform-independent level to a Schematron expression. Schematron is a language which J. Malý ( ) M. Nečaský XML and Web Engineering Research Group, Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic URL: maly@ksi.mff.cuni.cz M. Nečaský necasky@ksi.mff.cuni.cz enables to express integrity constraints at the XML schema level. We show that our approach saves time and prevents from errors made by designers when expressing Schematron constraints manually. Keywords Integrity constraints OCL UML XML Schematron Conceptual modeling 1 Introduction The idea behind Model Driven Development (MDD) (Miller and Mukerji 2003) is to model the software system on several layers varying in their degree of abstraction. A designer starts from the very abstract specification (independent of the platform and language used) and progresses to more concrete models (using platform-specific constructs) and finally to code. Ideally, each step of the transformation of the model from the more abstract to the less abstract is achieved by a declarative transformationobtained (semi-)automatically. The prominent modeling language in software engineering is Unified Modeling Language (UML, Object Management Group 2007). In the world of both objects and relational databases, commercial (Sparx Systems) as well as non-commercial (Eclipse Model Development Tools (MDT)) tools offer generators of object and relational schemas from UML models. From the point of view of MDA, object and relational models are considered as platforms which can be modeled at the platform-independent level using UML. In our research, we focus on a different platform XML. XML plays a key role in modern information systems. It is frequently applied for data exchange via web services and gains popularity as a primary means data storage as well. In our previous work (Nečaský

154 Inf Syst Front et al. 2012a, b), we have shown how to model XML schemas using UML class diagrams at the platform-independent and platform-specific levels. Our approach is effective in systems where many different XML schemas exist. These XML schemas, however, usually represent different views of the same data domain. To give an example, let s suppose an information system which records data about projects, project teams and employers. It contains a web service which provides data from the database to authorized clients. Data is provided in a form of XML documents of different kinds. For example, it provides details about a project using one kind of XML documents. It also provides information about team memberships and projects solved by a particular employee. For this, it uses another kind of XML documents. Each kind of XML documents is defined by a specific XML schema. Therefore, there are more different XML schemas in the system. Each of them, however, models a different view of the same domain. Our approach enables to model the domain as a platformindependent schema expressed as a UML class diagram. It models all important concepts including employees, teams of employees and projects, and also relationships between them. The schema is independent of particular XML schemas. It is rather a kind of a conceptual schema of the domain. The level below enables to model particular XML schemas as platform-specific schemas. They are, again, expressed as UML class diagrams. But, in this case the schemas contain specific constructs which model XML schema constructs rather than conceptual constructs. A particular platform-specific schema models a concrete XML schema. It reuses the components of the platformindependent schema and specifies how they are represented in the XML schema. In our previous work, we have provided a set of constructs for modeling structural features of XML schemas. However, we have not fully solved another issue which is also very important in the world of (not only XML) schemas integrity constraints. Information systems often rely on certain invariants and data-integrity constraints (ICs) that are required for the system to function properly. However, these ICs often can not be properly described by the visual languages of UML diagrams. In those cases where diagrams are not sufficient enough, Object Constraint Language (OCL, Object Constraint Language Specification 2012) can be used to describe additional properties and conditions of the system. OCL is a formal language and thus its expressions defined at the abstract layer can be used by the transformations and propagated to the specific layers. In this paper, we extend our XML schema modeling approach, where only structural features of the schemas can be modeled, with support for OCL expression and ICs. We show how ICs can be specified at the platform-independent level and how they can be translated to the XML schema level via the platform-specific level. Structure-oriented languages, such as W3C (2012a), Tim Bray and Paoli (2000) and Murata (2002) are well-suited to describe the dictionaries and nesting of elements, but to check more complex ICs over contained data, a different technique must be used. It would be possible to use a procedural language (e.g. Java), but XML platform has its own expression language XPath (W3C 2011) and rule-based schema language Schematron (ISO/EIC 2006), which can be used to define validity conditions using XPath expressions. Using the combination of Schematron and XPath surpasses using a language such as Java because they are declarative, platform independent and native XML technologies. When the system consumes XML as inputs (e.g. when Web Services (Booth 2007) are employed), invalid data can be recognized and rejected/sanitized right at the borders of the system. Another benefit is that the resulting schemas can be published as part of the API of the system. Motivation If we consider data encoded in XML documents we need to express ICs in a language which enables validation of the XML documents according to those ICs. For example, suppose an IC requiring that when an employee is a leader of a team he or she cannot be a member of another team. Suppose also an XML document which contains a detail of a particular project teams participating in the project and members of those teams with team leaders marked explicitly. To be able to validate the XML document against the IC, we first need to express the IC in a suitable language (e.g., Schematron) and formulate the expression with respect to the hierarchical structure of the XML document. Only this expression can be used to validate the XML document. However, the same information can be also encoded in another XML document for the purposes of another service client who requires the information in a different context. For example, it can be an XML document which provides a detail of a particular employee with the list of the teams he or she participates in. To validate this XML document against the IC, we need to formulate an expression which respects the structure of this XML document. It is clear that in this case the expression will be different from the previous one. As the example shows, we can have a whole set of ICs which are independent of particular XML schemas. If we want to validate XML documents against these ICs we need to express the ICs in a way which enables the validation. However, it can be necessary to create several expressions of the same IC for the purposes of different XML schemas. This can be, of course, very time consuming and error prone for designers. We propose to avoid this problems by enabling designers to model the ICs at the platformindependent level using OCL. This enables the designers to

155 Inf Syst Front express each IC only once against a platform-independent schema. Then, we can automate the process of translating a chosen OCL expression to the corresponding Schematron expression via the platform-specific level. As we show in the paper, the translation cannot be fully automated but we can help the designers with the translation and therefore make their work more effective. Contribution We have already described parts of the translation process described above in our previous research, namely transformation of OCL expressions in platformspecific schemas to Schematron expressions (Malý and Nečaský 2012). The contribution of this paper is two-fold. First, we introduce a novel algorithm for transforming OCL expression defined for a platform-independent schema to platform-specific schemas. We show when this part of the translation can be fully automated. We also show that there can exist more different possibilities of translating an IC. In that case, possible translations are offered to the designer who then decides which possibility is the right one (this is the only manual step required by our approach). Using our approach, the costs of development can be reduced (thanks to the large part of the process being automatic) and the development is less error-prone (the algorithm can identify relevant constraints for each XML schema used in the system, whereas a human designer may easily overlook some schema/constraint). Second, we extend the previously published algorithm for translating platform-specific OCL expressions to Schematron expressions. We propose a general algorithm for translating OCL into the expression language for XML - XPath. First, we describe our previously published algorithm in a much more detail and reveal its formal background. We also extend the previous algorithm so that it can now work with inheritance and recursive structures. The rest of this work is organized as follows: in Section 2, we introduce our 2-layered model for modeling XML data. In Section 3, we enrich the model with formal integrity constraints using OCL. Section 4 describes an algorithm for propagation of ICs between the levels of the model. Section 5 deals with automatic translation of OCL constraints into XML technologies Schematron and XPath. In Section 6, we showhowwe supportadvancedfeatures of UML models, inheritance and recursive structures. In Section 7, we introduce the experimental implementation of our approach. In Section 8, we examine the related work and in Section 9 we conclude. 2 Conceptual model In concord with Model Driven Development (Miller and Mukerji 2003) principles, we use 2-layered conceptual model in our approach. In this section, we introduce both the platform-independent and platform-specific layer formally. Our conceptual model remains the same as in our previous work (Nečaský et al. 2012a, b). 2.1 Platform-independent model The purpose of the platform-independent model is to describe the problem domain using general constructs, independent on the underlying implementation technology and language used. At the platform-independent level, we use the model of UML class diagrams (for the purposes of this paper, we will consider only classes, attributes and binary associations due to space limitations). Definition 2.1 formally introduces PIM schemas. Definition 2.1 A platform-independent schema (PIM schema) is a tuple S = (S c, S a, S r, S e ) of disjoint sets of classes, attributes, andassociations, respectively. a Class C S c has a name assigned by function name an Attribute A S a has a name, data type and cardinality assigned by functions name, type, and card, respectively. Moreover, A is associated with a class from S c by function class. an Association R S r is a set R = {E 1,E 2 }, where E 1 and E 2 are members of the set of association ends S e (E 1 and E 2 are called association ends of R). Both E 1 and E 2 have a cardinality assigned by function card and are associated with a class from S c by function participant. We will call participant(e 1 )and participant(e 2 ) participants of R. Association ends have names, denoted name(e). 1 For a class C S c, attributes(c) denotes the set of attributes of C and associations(c) denotes the set of associations with C as a participant. A subset K attributes(c) can be declared a key for C. In the text, we will usually refer to the schema constructs using their name, i.e. when we write class Employee, we speak about class C S c s.t. name(c) = Employee.Note that an association is formally an unordered set of its two endpoints. In other words, associations are unordered in PIM schemas. Even though UML allows for directed associations we do not consider them in this paper. As keys are concerned, the modeller should choose such a subset of class s attributes which uniquely identify an instance of the class. There can be more than one key for a class or none, if it is not important for the model. 1 When the name of an association end is the same as the name of the participant class (i.e. name(e) = name(participant(e))), we omit name(e) in the diagrams.

Inf Syst Front Fig. 1 A PIM schema with integrity constraints modeling an organization domain comprising organizations, departments, employees, teams and projects.

156 Inf Syst Front Fig. 1 A PIM schema with integrity constraints modeling an organization domain comprising organizations, departments, employees, teams and projects. Departments and employees have a key specified Figure 1 depicts a sample PIM schema that we will use throughout this paper. It models an organization where employees are divided into departments and work in teams on projects. Let us note that cardinality constraints 1..1 are not displayed. When a cardinality constraint is not present, it is implicitly There are several integrity constraints defined in the organization, but they cannot be formalized using diagrams by themselves. We will address them in the further sections of this paper. 2.2 Platform-specific model The purpose of the platform-specific model is to describe the system using constructs more tightly coupled to the selected platform and implementation technology, while reusing and referring to the general concepts defined at the PIM layer. At the platform-specific level, we use slightly modified class diagrams. A PSM schema in our approach is a formal model which models the structure of XML documents of a given document type and also their semantics in terms of a PIM schema. A concrete PSM schema can be translated automatically to an XML schema written using one of the XML schema languages 2 (Nečaský et al. 2012b). A distinctive feature of XML is its hierarchical structure XML elements form a rooted labelled tree. This needs to be reflected in PSM. Thus, associations in our PSM schemas are all oriented and the schema has a distinctive skeleton a forest. Also, there are two ways how an atomic value can be stored in XML as an element with a simple text content or 2 We currently support export to XSD and Relax NG. an attribute. To distinguish these options, we introduce function xform, which specifies, whether an attribute of a class is mapped to an XML attribute or an XML element with a simple text content. We define PSM schemas formally in Definition 2.2. In Nečaský et al. (2012b), we proved that PSM schemas have the expressive power of regular tree grammars (RTGs) (Murata et al. 2005). Definition 2.2 A platform-specific (PSM) schema is a tuple S = (S c, S a, S r, S e, S S ) of disjoint sets of classes, attributes, andassociations, respectively, and one specific class C S S c called schema class. Class C S c has a name assigned by function name, parent association assigned by partial function parentassociation and a list of child associations assigned by function childassociations. Classes without parent association are called root classes. Attribute A S a has a name, data type, cardinality and XML form assigned by functions name, type, card and xform, respectively. xf orm(a ) {e, a}. Moreover, it is associated with a class from S c by function class and has a position assigned by function position within the all attributes associated with class(a ). attributes(c ) will denote the sequence of all attributes of C ordered by position Association R S r is a pair R = (E 1,E 2 ),where E 1, E 2 S e (E 1 and E 2 are called association ends of R ). Both E 1 and E 2 have a cardinality assigned by function card and each is associated with a class from S c assigned by function participant. We will use parent(r ) to denote participant(e 1 ) (called parent of R )andchild(r ) to denote participant(e 2 ) (called child of R ). Association ends have names, denoted name(e ).IfE is the parent end, its name is always parent.

157 Inf Syst Front Let us note that another feature of XML documents is certain variability XML schema languages are usually based on RTGs and thus provide means of choosing between several options in a certain location of the schema. However, Definition 2.2 does not provide any modeling construct for this feature. In Nečaský et al. (2012b), we introduce a full definition which introduces an additional modeling construct called content model which allows for modeling such feature. However, it is not important for this paper so we omit it for now. A concrete PSM schema models a set of XML documents with certain structure defined by the schema. Table 1 shows how a specific XML format is modeled by a PSM schema. Each PSM schema is mapped to a PIM schema. The mapping specifies the semantics of the PSM schema in terms of the PIM schema. The mapping is realized as a function interpretation. Due to space limitations, we will not include its full formal definitions here it can be found in Nečaský et al. (2012a), together with the description of algorithms for propagation of changes between PIM and PSM level. For the purposes of this paper, we will use the following simplified definition of interpretation: Definition 2.3 The partial function interpretation I of a PSM schema S against a PIM schema S is a function I : S c S a S r S e S c S a S r S e s.t. X S c (S a, S r, S e resp.) : I(X ) = X = λ X S c (S a, S r, S e resp.). Informally, PSM model elements (classes, attributes, associations and association ends) can have interpretation a PIM element of the corresponding type. We will call a construct X from a PSM schema S s.t. I(X ) = X representation of X in S or say that X represents X in S. In the figures, PSM elements without interpretation are shown in the grey colour in the diagrams. Please note, that it is possible for two PSM constructs X 1 and X 2 to be two representations of the same PIM construct X, i.e.itis possible for multiple PIM constructs X 1...X k to have the same interpretation X. Also, a PIM construct X might not have a representation in a PSM schema S. Sample PSM schemas are depicted in Figs. 2a and3a, b. Each shows how a part of the reality modeled by the PIM schema is represented in a particular kind of XML documents. Let us explain in a more detail the first PSM schema depicted in Fig. 2a. Let us note that while we refer PIM components with their names in the text, we refer PSM components with their names supplemented with symbol to distinguish them from PIM components. Class OrganizationSchema in the PSM schema is the schema class. All other classes are mapped to their corresponding classes in the PIM schema depicted in Fig. 1. Each association in the PSM schema has a name and, therefore, models an XML element of that name. Majority of attributes have their XML form set to XML element and, therefore, model XML elements. The only exception is attribute empno of all three classes Member, Employee and Intern. The XML form set to XML attribute is depicted symbol at the attribute. Interpretation for all constructs besides the schema class OrganizationSchema and the association org is defined (to the corresponding PIM constructs, usually with the same name, i.e I(Team ) = Team). Classes Member, Employee and Intern all represent class Employee. Figure 2b depicts a sample XML document modeled by the PSM schema from Fig. 2a. 3 Modeling integrity constraints In this section, we show how Object-Constraints Language (OCL) (Object Constraint Language Specification 2012) can be used to express integrity constraints which cannot be expressed using the structural constructs defined in Table 1 XML modeled by PSM constructs Model Construct Modeled XML Construct C S Named child associations of the schema class C S model the allowed root XML elements. C S has no attributes by definition. C S c \ C S A complex content which is a sequence of XML attributes and XML elements modelled by attributes in attributes(c ) followed by XML attributes and XML elements modelled by childassociations(c ) A S a,s.t.xf orm(a ) = a An XML attribute with name name(a ), data type type(a ) and cardinality card(a ) A S a,s.t.xf orm(a ) = e An XML element with name name(a ), simple content with data type type(a ) and cardinality card(a ) R = (E 1,E 2 ) S r,s.t.name(e 2 ) = λ An XML element with name name(e 2 ), complex content modeled by child(r ) and cardinality card(r ).Ifparent(R ) = C S then the XML element is the root XML element. R = (E 1,E 2 ) S r,s.t.name(e 2 ) = λ Complex content modeled by child(r )

158 Inf Syst Front Fig. 2 A PSM schema modeling a type of XML documents with the structure of an organization. An organization contains departments. A department contains teams, employees and interns. A team contains its members. a PSM schema. b Sample XML document a b Section 2. We concentrate on both PIM and PSM levels. At the PIM level, an OCL expression specifies an integrity constraint at the conceptual level. At one hand, this OCL expression clearly specifies the constraint. At the other, however, it is not suitable for evaluation of the constraint over a given representation of data in a particular data model, in our case an XML document with structure modeled by a PSM schema. Therefore, we also consider OCL expressions at the PSM level. At the PSM level, an OCL expression is an augmentation of an OCL expression from the PIM level to the shape suitable for evaluating the constraint over the representation of the data in an XML Fig. 3 Another 2 PSM schemas for departments, projects and teams. a A PSM schema modeling a type of XML documents with organizations which contain projects and departments in the organization. A project contains teams and teams contain hosting departments. b A PSM schema modeling a type of XML documents with teams. A team contains a hosting department and organization. It also contains its members. For each member, there is its employing department and, optionally, a department where the member is currently doing his or her internship a b

159 Inf Syst Front document with structure specified by a PSM schema. Our goal is to show how an OCL expression O at the PIM level can be translated to an OCL expression O over a given PSM schema such that both O and O specify the same integrity constraint. 3.1 OCL expressions at the PIM level Since PIM schemas are UML class diagrams, we can directly use the standard OCL (Object Constraint Language Specification 2012). In the rest of this article, we will delimit an OCL expression in the text using guillemots, e.g. this: «x + y > 1» is an OCL expression. We will use large uppercase letters when we will speak about OCL expressions at the PIM level (usually O). We add an apostrophe when we speak about OCL expressions at the PSM level (e.g. O ). Sample OCL expressions are depicted at Fig. 1. The syntax and semantics of OCL is formally introduced in Richters and Gogolla (2002). However, we do not need such a complex formalism in this paper. For our purposes, we will suffice only with a subset of OCL and formalize only those parts of OCL expressions which are important for the translation between PIM and PSM levels. Let us explain the subset on the following OCL expression O expressed over the PIM schema depicted at Fig. 1: context Organization inv PIM_IC1: self.department.team >forall(t t.member >size() < 0.1 * self.department.employee >size()) The constraint specifies that in an organization, each team in any department cannot have more than 10 % of the total number of employees of the organization. The first line of O specifies the context of O which is a class Organization. (OCL allows more classes and, also, class methods to be in the context. However, we consider only the simple case with only one class in the context for the purposes of this paper. Class methods are not interesting for our work because we model only data.) The rest of O is so called invariant which is an expression which is evaluated at the instance level to true or false. Not only invariants, but other kinds of expressions are allowed as well (e.g. value derivation, etc.) but we do not consider them in this paper. The invariant contains navigation paths, e.g. «self.department.team». A navigation path starts in a variable and navigates via one or more steps in the structure of the PIM schema. At the instance level, the evaluation of such navigation path starts in an instance c of a class which is assigned to the variable and navigates to a collection of instances associated with c via the navigation path. A path may be followed by a collection function, e.g., forall or size, which is a kind of iteration function. It iterates the collection resulting from the evaluation of the path. It may return another collection or a simple data value. Formally, a step is also a collection function which iterates the result of the previous step, evaluates the specified navigation for each iteration and returns the new collection or value. The syntax of OCL allows also for navigation paths which do not start in a variable, but, e.g., at a collection function. However, we do not consider this kind of navigation paths in this paper for simplicity. Let us only note that these paths can be translated to those which start in a variable. As we will show in the next section, having an OCL expression O over a PIM schema S, we only translate its context class and navigation paths. The other parts of O remain untouched. Therefore, we will use a simplified formal model which represents O as a pair (C, {p 1,...,p n }) where C is the context class C and each p i is a navigation path in S. Anavigation path p in S is a construction s 0...s n where each s i is called step of p. The first step s 0 is a variable. We distinguish the context variable (denoted self ) and the other user defined variables (usually denoted by a character, e.g. v, k,etc.).asteps i specifies an attribute A (then s i is the name of A) or association endpoint E (then s i is the name of E or the name of the association E belongs to). If s i is an attribute then i = n. Our sample OCL expression above is represented in our formalism as a pair: (Organization, {«self.department. Team», «t.member», «self.deparment.employee»}). 3.2 OCL expressions at the PSM level Since we are using modified class diagrams at the PSM level, we can also use OCL to express integrity constraints similarly to the PIM level. It would be possible to define translation of PIM constraints directly to an implementation/validation language, but we decided to use OCL at the PSM level as well OCL is better suited for class diagrams (even at the PSM level), it can be managed more easily when schemas change and transformation of constraints from the PIM level is more transparent when the same language is used. However, since we modified class diagrams to fit the needs of XML modeling, we will extend OCL language for the PSM level as well. Traversing among PSM classes Several PSM classes can be mapped to the same PIM class via the interpretation function (which is the case of classes Intern and Employee in Fig. 2a, both representing PIM class Employee). This property of PSM schemas requires special support in our PSM OCL expressions. Let us consider the following PIM OCL expression specifying an invariant ensuring that interns are making their internships in other departments than their home department. context e:employee inv PIM IC3: e.internship <> e.employer

160 Inf Syst Front Suppose that we want to transform this PIM OCL expression to a PSM OCL expression expressed over the PSM schema depicted in Fig. 2a. We must act cautiously because for class Intern there is no association end mapped to the end employer in the PIM schema and for class Employee, there is no association end mapped to internship. However, PSM classes Employee and Intern are linked semantically (they havethesameinterpretation class Employee in the PIM schema). Therefore, a single instance of class Employee may be represented by an instance of Employee and a different instance of Intern in the same XML document. Therefore, no straightforward variant of the OCL expression above will work for the PSM schema. For this purpose, we introduce a new function that allows for traversing a PSM schema along this kind of semantic links. In fact, we introduce a function for each class C in the PSM schema. Thename of thefunction is to followed by the name of C. The function has no parameters. For example, class Employee has a name Employee. Therefore, we introduce function toemployee(). Traversing is possible between classes with the same PIM interpretation (Employee and Intern in our example). Thus, the function can be called on a source expression of type C 1 s.t. interpretation of both C and C 1 is the same PIM class C (Employee in the example). The result of the function is of type Set(C ) (Set(Employee) in the example). The resulting set contains all instances of C, which represent the same PIM concept which is represented by C 1. In our example, for an intern i, the expression «i.toemployee()» returns the intern s employee record. context i: Intern inv PSM_IC3: i.department <> i.toemployee().department Evaluation of the expression starts in the PSM class Intern and needs to evaluate two navigation paths. The former path needs to navigate from an instance i of Intern to an associated instance di of Department via an association end which represents the end internship from the PIM schema. This is possible because the PSM association having Intern as its child represents the PIM association with internship as an end. However, it also needs to navigate from i to another associated instance de of Deparment via an association end which represents the association end employer from the PIM schema. This is not possible. The OCL expression has to traverse from the Intern instance to the corresponding Employee instance using the function toemployee. From here, the required navigation is possible. In the previous sample, the two navigation subexpressions («i.department» and«i.toemployee().department») navigate the model in the upwards direction. Both refer to the parent association end using the name of the class. Another option would be to use the name of the association end, which is always parent in the case of parent association ends. The previous PSM OCL constraint can be alternatively expressed as follows: context i: Intern inv PSM_IC3: i.parent <> i.toemployee().parent Let us note that at the PIM and PSM levels, which are conceptual, we do not solve how the function is implemented. We consider an object identity of the instances which allows the traversing. In the phase of translation of PSM OCL constraints to XPath (which is the subject of Sections 5 and 6), we show how we deal with the added functions at the XPath level (we use PIM keys). 4 Translating integrity constraints from PIM to PSM In this section, we show how PIM OCL expressions are translated to their PSM OCL equivalents. For this section let S beapimschemaands be a PSM schema, each with an interpretation against S. Our goalis to translate apim OCL expression O over S to O i for each S i whenever it is meaningful. The resulting O i must specify the same constraint over S i as specified by O over S. Let us note that a PSM schema often does not contain representations of the whole problem domain described in the PIM (a PSM schema may contain several classes relevant for a certain component of the system, while others, which are not important for the component, are not represented in the schema). It is clear that it is meaningful to translate O only when all classes, attributes and associations in S restricted by O are represented in S i. In that case, we will say that S i covers O. Formally: Definition 4.1 Let X be the set of all PIM constructs referenced from the PIM constraint O. We will say that PSM schema S with an interpretation I against S covers O iff X X X S c S a S r S e s.t. I(X ) = X. 4.1 Direct translation We will first discuss a direct translation of the PIM OCL expression O to its equivalent PSM OCL expression O over a given PSM schema S. Basically, the direct translation starts with replacing the context class C of O with its representation C in S (i.e., I(C ) = C). C becomes the context class of O. The translation then proceeds with translating all navigation paths in O. The paths of O form a hierarchy. Each

161 Inf Syst Front path p consisting of steps s 0,...,s m is a node of this hierarchy. The first step s 0 is always a variable. When s 0 = self, then p is at the top-level of the hierarchy. Otherwise, s 0 is a variable v declared by a collection function which follows another navigation path p in O, i.e., «p colf (v Q)», where Q is a PIM OCL expression referencing variable v and containing path p and colf denotes an iterator expression. (Let us note that we do not consider the OCL construct let.) In that case, p is superior to p in the hierarchy and we will denote p as superior-path(p). The translation of the paths in O proceeds from the paths at the higher levels of this hierarchy to the lower levels starting at the top-level. The correctness of the translation algorithm is based on the fact that the direct translation algorithm only translates paths in PIM OCL expressions to paths in PSM OCL expressions and strictly follows the hierarchical structure of the PSM schema. In other words, when a particular instance is targeted by a PIM OCL path then it is also targeted by the corresponding PSM OCL equivalent. However, as we discuss in Section 4.2, there are situations where the direct translation does not work correctly for example, when there are more different PSM classes with the context class C as their interpretation, the direct translation algorithm does not know which one choose for the translation. Let us now discuss the translation of each single path p in O, s.t.p has steps s 0,...,s m. The pseudo-code of the translation algorithm is depicted in Algorithm 1. The translation starts with step s 0, which remains unchanged. It establishes so called translation context for the next step s 1. The translation context is always a class in S. When

162 Inf Syst Front s 0 = self, the translation context is the context class C of O. In other cases the translation context is the class targeted by the translation of superior-path(p). The targeted class is located during the translation of superior-path(p). The exact mechanism of how the targeted class is located during the translation is described in the following paragraphs. The translation algorithm then proceeds consecutively through steps s 1,...,s m. Each step s i is translated as follows. Let a class C i be the translation context of the step s i. We locate the represented PIM class C i = I(C i ).Thesteps i specifies an attribute of C i or an association endpoint of an association R i connected to C i. We will denote the attribute or association endpoint with X i. When X i is an attribute of C i then we locate an attribute X i S a of C i which represents X i. When X i is an association endpoint of an association R i then we locate an association X i S a which represents R i and connects C i with another class C i+1. When the located X i is an attribute we replace s i with thenameofx i and the translation of p is done. When X i is an association and parent(x i ) = C i we replace s i with thenameofx i orthenameofchild(x i ) (when X i does not have a name). When child(x i ) = C i we replace s i with the reserved word parent. When i = m the translation is done and C i is the class targeted by the translation of p. When i<mthe translation proceeds to s i+1 and the other class C i+1 becomes the translation context for the next step s i+1. Let us now demonstrate the direct translation algorithm on the PSM schema depicted in Fig. 2a and recall the OCL expression O over PIM schema depicted in Fig. 1 which we have already discussed in Section 3: context Organization inv PIM_IC1: self.department.team forall(t t.member size() < 0.1 * self.department.employee size()) First, the algorithm identifies the context class of O.It is the PSM class Organization which represents the PIM class Organization whichisthecontextclass ofo. Then it translates all navigation paths in O, i.e., p 1 :«self.department.team» p 2 :«t.member» p 3 :«self.department.employee» The paths p 1 and p 3 start with the context variable self and are therefore at the top level of the hierarchy of paths in O. The path p 2 starts with the variable t which is declared in the collection function forall which follows p 1.Therefore superior path(p 2 ) = p 1. The algorithm firstly translates paths at the top level of the hierarchy, i.e., p 1 and p 3. For both, the initial translation context is the PSM class Organization. The result is p 1 :«self.dpt.tm» p 3 :«self.dpt.emp» Then, it translates p 2. Its initial translation context is the class targeted during the translation of p 1, i.e., Team.The resulting path is as follows: p 2 :«t.mem» The algorithm translates only the context class and the navigation paths in O. The rest is not translated. The resulting PSM OCL O constraint is: context Organization inv PSM_IC1: self.dpt.tm forall(t t.mem size() < 0.1 self.dpt.emp size()) The direct translation also succeeds in translating the constraint PIM IC2 from Fig. 1: context Organization inv PSM_IC2: self.budget <= self.dpt.budget sum() 4.2 Problems with direct translation and its improvements The results of the direct translation algorithm are correct only in certain cases. Let us now analyze situations when the direct translation does not work correctly. The first problem is that a PIM class C S c may have more different representations C 1,...,C k S c. Therefore, when C is the context class of O it is hard to decide which of the representations should be the context class of O.The ideal option is to choose a representation C whose attributes and associations represent all attributes and associations of C restricted by O. However, such representation may not always exist. Let us demonstrate the problem on the PSM schema depicted in Fig. 2a and the following OCL expression O over the PIM schema depicted in Fig. 1 (it specifies that an employee can not be on its internship in a department which is his or her employer): context Employee inv PIM_IC3: self.internship <> self.employer The context class of O is Employee. O restricts two associations of Employee : {Employee, employer} and {intern, internship}. There are three different representations of class Employee in the PSM schema: Member, Employee, andintern. We have to decide which of them will be the context class of the resulting O. It is not meaningful to choose class Member as the context class. Its associations do not represent anything from the two associations constrained by O. On the other hand, the parent association of Employee represents association {Employee, employer}. The parent association of Intern represents {Employee, internship}. At the same time however, the former is missing an association which would represent {Employee, internship} and the other is missing an association representing {Employee, employer}.

163 Inf Syst Front As the example shows, automatic selection of the context class of O can not be achieved, because they may be several candidates, evenly suitable. As a first heuristic, we may restrict the candidates only to those which are meaningful. A candidate C is meaningful only when its attributes and associations represent some of the attributes and associations of C restricted by O. Formally, there must exist a navigation path p = s 0...s n in O s.t. s 0 = self and one of the following conditions must be satisfied: If s 1 leads to an attribute A then there is an attribute A attributes(c ) s.t. I(A ) = A. If s 1 leads to an association end E 1 of an association R ={E 1,E 2 } then there is an association R associations(c ) s.t. I(R ) = (E 1,E 2 ) or I(R ) = (E 2,E 1 ). (Let us note that a step of a navigation path always leads to an attribute or association end because of the restrictions to the syntax of navigation paths given in Section 3). It is clear that there may exist several different meaningful candidates C cand 1,...,C cand m. However, we are not able to choose the candidate automatically (in the presented case, neither candidate is better). We therefore need a user who decides which candidate will be chosen. In our sample, we have two candidates: Employee and Intern. The user may choose any of them as the context of the resulting translation. The chosen candidate C may be insufficient because its attributes and associations may not represent all attributes and associations of C restricted by O. In other words, there may be a navigation path p = s 0.s 1...s n in O s.t. it is not possible to translate the step s 1 with Algorithm 1. This is because it is not possible to navigate from the chosen translation context C according to s 1. In that case, it is necessary to change the current translation context to another PSM class D s.t. I(D ) = C. The attributes and associations of D must represent some of the attributes and associations of C restricted by O. A similar situation may occur later during the translation of a step s i of p because it is not possible to navigate from the translation context C i of s i as specified by s i.again,weneedto change the translation context to class D i similarly. In both scenarios, the translation context is changed by applying the collection function tod i. This means appending.tod i () to the end of the already translated part of O before the translation of s i itself. Similarly to the problem of choosing the correct context class, it may be a problem to choose the correct D i.again, when there are more possibilities, we need a user who makes the selection. As a demonstration, suppose the PSM schema depicted in Fig. 3a. In this schema, interpretation I(Department ) = I(Host ) = Department. We will attempt to translate the following OCL expression O over the PIM schema depicted in Fig. 1 (it specifies that project teams may be hosted only by departments in a sponsoring organization): context Project inv PIM_IC4: self.team forall(t t.host.organization = self.sponsor) In this case, the direct translation does not work.it automatically selects class Project as the context class of O. (There is no other class in the PSM schema representing Project.) O contains following navigation paths: p 1 :«self.team» p 2 :«t.host.organization» p 3 :«self.sponzor» The path p 1 is translated to a path «self.team». Its translation targets the class Team. Because p 1 is the superior path to p 2, the translation of p 2 starts with the translation context set to Team. The translation of the first step of p 2 host moves from Team to Host. From here it needs to move to Department. However, there is no association in the PSM schema connected to Host representing the PIM association connecting the PIM classes Department and Organization. Therefore, we need to change the translation context to class Department whose parent association represents the PIM association. For this we call the collection function todepartment. The last step of p 2 is then translated to a step navigating from Department to its parent class Organization. The path p 3 is translated directly and the resulting translation of O is context Project inv PSM_IC4: self.team forall(t t.host.todepartment().parent=self.parent) Changing the translation context is also necessary when O compares instances of the same PIM class which are targeted by two different navigation paths in O. Suppose the following OCL expression O over the PIM schema depicted in Fig. 1 context Department inv PIM_IC5: self.employee.team.host forall(h h = self) It specifies that teams of employees of a given department can be hosted only by that department. It contains two navigation paths: p 1 :«self.employee.t eam.host» p 2 :«self» Both target class Department in the PIM schema but through two different navigation paths p 1 and p 2. Suppose the PSM schema depicted in Fig. 3b. In this schema interpretation I(Department ) = I(Employer ) =

164 Inf Syst Front I(Internship ) = Department. The direct algorithm translates O to the following PSM OCL expression O : context Employer inv: self.parent.parent.host = self During the translation, the user had to decide that the context class of O will be Employer because there were two possible context classes (the other is Department ). The resulting OCL expression is not correct because it compares instances of the target classes of two different navigation paths in the PSM schema: Department and Employer. Instances of two different classes are not comparable. Therefore, one of the navigation paths needs to be supplemented with to function. With this function, the resulting OCL expression compares instances of the same class: context Employer inv: self.parent.parent.host.toemployer() = self (Let us note that this OCL expression is still not correct because there may be more different instances of class Employer which represent the same instance of the PIM class Department. Therefore, it compares a collection of instances with a single instance which is not correct. We will discuss this in the rest of this section.) The second problem is the hierarchical nature of S which may lead to redundant occurrences of instances of classes of S in XML documents. Suppose a class C in S which represents a class C in S. As we explained in Section 2, C specifies how instances of C are represented in XML documents. In other words, an instance of C models an occurrence of the instance of C in XML documents. Let R be the parent association of C which represents an association R = (E C,E D ) S r,participant(e C ) = C, participant(e D ) = D (association R connects classes C and D in S). Let the maximal cardinality of E D in R be m..n st. n>1. In that case, there may be more different instances of C in the same XML document modeling an occurrence of the same instance of C. We call this situation that C leads to redundant occurrences of C. The problem with redundant occurrences is that a PIM navigation expression can not be directly translated to a corresponding PSM navigation expression where the direction of navigation is upwards. Association Team Employee in the schema in Fig. 3b is an example of this issue. When t is an instance of Team, expression t.member returns all members of the team. However, when e is an instance of Employee, expression m.parent does not return all teams having m as a member. It returns only one team (because in XML, a node has only one parent). In the general case, C leading to redundant occurrences of C complicates the translation of O, if O navigates via R from C to the other end D. For a given instance of C, O retrieves a collection of all associated instances of D.On the other hand, O navigates from an instance of C,i.e.from one occurrence of the instance of C, in the upwards direction via R. Because of the hierarchical structure, an instance of C has only one parent instance. Therefore, O navigates only to an occurrence of one associated instance of D instead of occurrences of all associated instances of D and the directly translated O from O does not work correctly. For demonstration, suppose the PSM schema depicted in Fig. 3b and the following PIM OCL expression O over the PIM schema depicted in Fig. 1 (it specifies that an employee can not be a member of more than 5 teams): context Employee inv PIM_IC6: self.team >size() < 5 This simple OCL expression is translated by the direct algorithm to the following PSM OCL expression O : context Employee inv: self.parent >size() < 5 The problem is that while O works with an instance of Employee,O works with an occurrence of that instance. Because ofthe cardinality of Employee in the association, there may be more different occurrences of the instance. Therefore, O does not work correctly it cannot count the number of teams an employee is a member of. Another complication caused by redundancies may appear when O compares instances of the same PIM class which are targeted by two different navigation paths in O. We have already discussed this kind of OCL expressions previously and we have used the following sample expression: context Department inv PIM_IC5: self.employee.team.host >forall(h h = self) We have shown that the direct algorithm extended with function to results to the following OCL expression O (the user chooses Employer as the context class): context Employer inv: self.parent.parent.host.toemployer() = self The problem is that Employer leads to redundant occurrences of Department. Therefore, function toemployer leads to a collection of more different instances of Employer which represent the same instance of Department. However, O compares this collection with a single instance of Employer assigned to self variable. On the base of the previous discussion we can see that the navigation principles of PIM OCL expressions are different from those of PSM OCL expression. This might cause that some directly translated PSM OCL expressions are not correct. A direct simple solution to this problem is to forbid redundancies. More precisely, a navigation path can not navigate in a PSM schema in the upwards direction via an

165 Inf Syst Front association where the maximum cardinality of the child in the association is greater then 1. In that case, the direct translation is not performed and a user is notified. A more advanced but more complex solution is to extend the direct translation algorithm so that it translates not only the context class and navigation paths in PIM OCL expressions to their PSM equivalents but also works with other OCL constructs. However, this is out of scope of this paper and we leave it as our future work. Let us only demonstrate this extension on a few examples. First, let us again consider the last PIM OCL expression and its translation. The solution of the problem of comparing the collection of instances of Employer with a single instance is to change the comparison operator = to a collection function forall. The resulting PSM OCL follows. context Employer inv: self.parent.parent.host.toemployer() >forall(h h = self) Second, let us again consider the PIM OCL expression which restricts the number of teams an employee can participate in: context Employee inv PIM_IC6: self.team >size() < 5 The direct translation algorithm could not translate this expression to a PSM OCL expression over the PSM schema depicted in Fig. 3b as we have already discussed. The solution is to completely change the logic of the expression at the PSM level so that it moves only in the top-down direction. Therefore, the problems with redundancies do not occur. We need to change the context to a class whose instance contains all instances of Team. This class is Teams. From here, we need to get to all contained team members. For each member, we need to check whether the number of teams which contain the member is lower than 5. Because we are at the PSM level, each member can be represented by more different instances of Employee. Therefore, we will not compare instances of Employee, but we utilize the key for Employee (empno) instead: context Teams inv PSM_IC5: self.team.member >forall(m self.team >select (t t.member.empno >includes(m.empno)) >size() < 5) Let us finally discuss the time complexity of the translation algorithm. The time complexity of the direct translation algorithms is linear with respect to the total length of navigation paths in a given PIM OCL expression. However, as we have already discuss, the direct translation does not work in many practical cases because classes in the PIM schema may have more different representations in the PSM schema. Then we have to apply rules which choose the correct representation for a given PIM class. Because of these rules the time complexity becomes O(M N) in the worst Fig. 4 Simple Schematron rule case where M and N are the number of classes in the PIM schema and PSM schema, respectively (for each PIM class C, we have to check all PSM classes which represent C). However, the worst case will be hardly achieved in practice. Usually, a PIM class has only one or few (i.e., << N) representations in the PSM schema. Moreover, the amount of construct is not so high in real world schemas and, therefore, the time complexity of the translation is not restricting. 5 Validation of integrity constraints in XML documents In this section, we present an algorithm for automatic translation of PSM OCL scripts into Schematron schemas, which can be used to validate integrity constraints in XML documents. A Schematron schema complements an XSD/RNG/DTD schema which prescribes the overall structure of XML documents. 3 Schematron is a straightforward rule-based language. It consists of rule declarations, where every portion of a document matching a rule (match patterns follow the same syntax as in XSLT (W3C 2012c) templates) is tested for assertions defined in that rule (assertions are expressed as XPath tests, an assertion is violated, when the effective boolean value of the expression is false). The examplerule in Fig.4 tests, whether every element person has subelement name. The usage of XSLT patterns for contexts of rules and XPath expression for tests of asserts was chosen because those are technologies well established in the XML ecosystem. It also facilitates Schematron validation using an XSLT processor an XSLT pipeline can be used to translate a Schematron schema S into a validation XSLT transformation T S. T S is executed upon a validated XML document and outputs structured information 4 comprising successfully checked constraints, violated constraints, and the locations of the errors. 5 The power of Schematron is thus determined by the power of XPath. As we will show later in this section, for some classes of OCL expressions, a corresponding construct in XPath does not exist. For such cases, we created a library 3 As a matter of fact, the recommendation of XML Schema 1.1 (W3C 2012a) allows to include some of the Schematron constructs directly in the XSD. 4 Results produced by T v are formated using SVRL Schematron Validation Report Language, which is part of Schematron specification (ISO/EIC 2006). 5 offers an implementation of XSLT pipeline to generate T S for public use.

166 Inf Syst Front of XSLT functions, called OclX. Functions from OclX can be used in XPath expressions to provide sufficient expressive power. Since OclX is implemented using pure XSLT, our approach does not require modification in Schematron validators if the validator uses XSLT internally, its logic can be preserved providing that T v imports OclX library (details of the validation workflow are described later in Section 7). To start off, we show how the principal OCL constructs can be translated to a Schematron schema. This step creates a skeleton of the schema. It is apparent that rules contexts and asserts in Schematron play the same role as contextsand invariants in OCL. Thus, by creating a rule for each OCL context declaration and adding an assert in the rule for each invariant, we can create a schema verifying the validity of PSM ICs. Table 2 outlines the rules for translation, Fig. 5 shows a concrete example. The crucial step of the algorithm (emphasized in the table), which will be explained in the rest of this section, is how to translate a PSM OCL invariant body expression O to an XPath expression X O. To achieve the desired property that a Schematron assert really verifies the validity of the corresponding OCL expression the translation must follow the following principle: Principle 1 (consistency) Let X O be the XPath expression obtained by translating a PSM OCL invariant O. Then, the effective boolean value of X O is true iff invariant O holds. We will construct the expression so that their atomic value is always of XSD type xs:boolean. In that case the effective boolean value equals to the value of the expression. We will now look at different kinds of OCL expressions depicted in Fig. 6. We will elaborate how they can be expressed equivalently in XPath. From now on, we will apply some restrictions on the class of considered OCL expressions. We will omit StateExp and Message- Exp, since the notions of state and message(signal) have no counterparts in our domain (XML data). Due to space restrictions, we will also omit TypeExps, which deals with Table 2 Translation of principal OCL constructs OCL construct OCL script Constraint block Context classifier Context variable Invariant Invariant body Error message Subexpression in error message Schematron construct Schematron schema Pattern with a rule Pattern id let instruction for a variable Assert Expression in assert test Failed assert text value-of instruction in assert casting, and we will also treat all collections as sequences. Due to the architecture of XPath data model, we will also not allow nested sequences in expressions. We will get back to theses restrictions in Section 5.6 and to problem of nested sequences and different types of collections in Section 9. These conditions leave us with LiteralExp, IfExp, VariableExp, LetExp, two kinds of LoopExps (IteratorExp and IterateExp) andfeaturecallexp (which encompasses operations (and operators) and references to attributes and associations defined in the UML model). We also have to define consistent handling of variables. In the rest of this paper, a translation of an OCL expression O into XPath will be denoted by X O. 5.1 Variables, literals, let and if expressions Variables There are three ways a variable can be defined in an OCL expression. Each invariant has a context variable, which holds the validated object. It can be named explicitly (such as t in Fig. 5) or, when no name is given, the name of the context variable is self. Iterator expressions (described later in this section) declare iteration variables (such as t in the expression «self.team -> forall(t...))» in PSM IC4 in previous section). Let expressions (described later in this section as well) define a local variable. We will construct the expression in such a way that the following principle holds: Principle 2 Every OCL variable used in O corresponds to an XPath variable of the same name in X O. References to OCLvariables (VariableExp) are translated as references to XPath variables. The OCL context variable (with default name self or named explicitly) is common in all invariants declared for the context. Therefore, to declare corresponding variable, we can utilize Schematron sch:let instruction in each rule (line <sch:let name= t value=. /> in the example). Declaration of XPath variables for the other OCL variables (declared as a part of LetExp or LoopExp) will be created in accordance with Principle 2, as we will demonstrate later in this section. LetExp defines a variable and initializes it with a value. The variable can be referenced via VariableExp in the subexpression of the given LetExp. XPath 3.0 added a corresponding construct let/return expression. Thus, the following principle is in accord with Principle 2. Principle 3 Let O be a LetExp expression «let x : Type = initexp in subexp», where initexp and subexp are both OCL expressions and the latter is allowed to reference variable x. Than O is translated to an XPath expression X O : let $x:= X initexp return X subexp., where

167 Inf Syst Front Fig. 5 Example of translation of principal OCL constructs X initexp and X subexp are translations of initexp and subexp respectively. LiteralExp OCL allows literals for the predefined types, collection literals (e.g., «Sequence{1, 2, 3}»), tuple literals and special literals «null» (representing missing value) and «invalid» (representing erroneous expression). Principle 4 OCL literals are translated according to the following table. OCL XPath predefined type literal corresponding XSD (literals for Real e.g. «1.23», primitive type literal String e.g. «hello»etc.) (1.23, hello etc.) sequence literal, XPath sequence literal, e.g. «Sequence(1, 2, 3)» e.g. (1,2,3) tuple literal, e.g. XPath map literal (more «Tuple{ name = J ohn, about tuples in age = 10 }» Section 5.4) «null» literal «invalid» literal empty sequence literal () call of OclX function invalid()(more about error handling in Section 5.5) IfExp Conditional expression in OCL has the same semantics as in XPath, it can be translated directly. Principle 5 Let O be an IfExp expression «if cond then thenexp else elseexp». Then O is translated to an XPath expression X O : if (X cond ) then X thenexp else X elseexp. 5.2 Translating feature calls In the following subsections, we will show how we translate FeatureCallExp. There are two types of features in UML properties and operations, which can be referenced via respective FeatureCallExps, as depicted in a separate diagram in Fig. 7. We will elaborate on both types PropertyCallExp and OperationCallExp separately. PropertyCallExp Examples of navigation expressions via PropertyCallExp are e.g., «self.budget» (in PSM IC2), «e.internship» (in PSM IC3) or «self.team.member» (in PSM IC6). The first one navigates to an attribute budget of class Organization, the second one navigates an association end internship which is a part of the association between classes Employee and Department. Every FeatureCallExp has a source (inherited from CallExp, see Fig. 6). The source in the first example is a VariableExp «self», in the second example, the source is a VariableExp «e». The third example is a chain of two PropertyCall- Exps. The source in the third example is a PropertyCallExps «self.team» whose source is a VariableExp «self». The whole navigation starts in class Teams, goes through Team and ends in Member. Navigation to properties can be translated by appending an XPath step which uses either the child or attribute axis. Translation of navigation to association ends depends on the direction of the association and whether the association has a name (an association without a name means that the subtree under the association is not enclosed by a wrapping tag, thus no XPath navigation is added). The following principle thus follows the rules from Table 1. Principle 6 PropertyCallExp is translated by appending an XPath step to the translation of the source expression. Let O be a PropertyCallExp expression «source.p» and X source be a translation of subexpression source.ifp navigates to an attribute A S a and n = name(a ),theno is translated to X O as follows: X O = { XO /child ::n if xf orm(a ) = e X O /attribute ::n if xf orm(a ) = a

168 Inf Syst Front Fig. 6 Different kinds of OCL expressions (source: OCL specification (Object Constraint Language Specification 2012), Chapter 8.3) If R = (E 1,E 2 ) is an association and p navigates to one of its ends E {E 1,E 2}, O is translated as follows: X O if name(r ) = λ X X O = O /child ::n if E = E 2 name(r ) = n X O /parent::node() if E = E 1 OperationCallExp Application of predefined infix and prefix operators, calls of OCL standard library operations and calls of methods defined by the designer in the UML model all come under OperationCallExp. For a majority of predefined operators (such as «+», «and», etc.), a corresponding XPath operator exists as well. For those where no corresponding operator exists (e.g., «xor»), we provide a corresponding function in OclX library. (We do not include the exhaustive list in the paper. It can be found in the documentation for OclX). 6 Principle 7 Every OperationCallExp O is translated into a call of corresponding operation/operator (with the same number of parameters; the translation of the source expression in O becomes the first argument in XPath in X O, followed by the translation of the operation arguments in O ). The corresponding operation/operator is either a builtin XPath expression or it is defined in the OclX library. In Section 3, we introduced a new function D.toE, which is defined for each pair of PSM classes C 1 and C 2 with the same interpretation (C = I(D ) = I(E )). This function call can be translated when there is a key defined for C and the key is represented both in D and E.Thekey can be utilized to locate the instances of C 2 in the schema, as is described by Principle 8. 6 OclX documentation: Principle 8 Let D and E be a pair of PSM classes s.t. C = I(D ) = I(E ) and let K C = {A 1,...,A n } attributes (C) be a key for C, K D = {D 1,...,D n } attributes (D ) a set of PSM attributes representing attributes from K and K E = {E 1,...,E n } attributes (E ) s.t. I(D 1 ) = I(E 1 ),..., I (D n ) = I(E n ). OperationCallExp O k = S.toE () (where S is of type «Collection(D )») can be translated as X O = (for $p k in X S return X A [$p/name(d 1 )eq./name(e 1 ) and... and $p/name(d 1 ) eq./name(e 1 )]), where X A is an XPath expression returning all instances of E in the schema Translating iterator expressions Loop expressions, e.g., «self.team->forall(t t.host. Organization = self.sponsor)» are archetypal for OCL they perform the task of joins, quantification, maps and iterations. They are called using -> operator. Instead of a list of parameters, the caller specifies the list of local variables and the body subexpression (see Fig. 6). There are several important facts regarding loop expressions: 1. There are two kinds of LoopExp, a general IterateExp and IteratorExp. The general syntax of IterateExp is: «iterate(i : Type; acc : Type = accinit body )», where i is the iteration variable, acc accumulator variable, accinit the accumulator initialization expression and body is an expression, which can refer to variables i and acc. The result is obtained by calling body expression repeatedly for each member of the collection (which is assigned to i and acc is assigned the result of the previous iteration). The value of acc after the 7 The syntax also presumes that all the key PSM attributes are represented as PSM elements (xf orm = E. For those attributes, which are not, attribute axes will is used to access them)

169 Inf Syst Front from body expression as well either the context variable (self ) or variables defined by outer LetExp or LoopExp expressions. Variables except iteration variables (and accumulator in iterate) are free in the body expression. Fig. 7 Different kinds of FeatureCallExp expressions last call is the result of the operation. The general syntax of IteratorExp is: «iteratorname(i:type body )», where iteratorname is one of the predefined OCL iterator expressions (such as exists, closure, etc.) or may be defined in a user extension, i is the iteration variable and body is an expression, which can refer to the iteration variable i (and all other variables valid in the place where the iterator expression is used). The semantics of the IteratorExp depends on the concrete iterator. The semantics for the predefined operators is given by the specification. 2. Except closure, all other predefined iterator expressions (and a majority of collection operations) can be defined in terms of the fundamental iterator expression iterate, e.g., «exists(it body)» is defined as «iterate(it; acc : Boolean = false acc or body)». Semantics of user-defined iterator expressions can be defined using iterate as well. 3. Iterator expressions forall and exists (serving as quantifiers) together with boolean operators not and implies make OCL expressions at least as powerful as the first order logic. Operation closure increases the expressive power with the possibility to compute transitive closures. Operation iterate allows to compute primitively recursive functions (for more on the expressive power, see Mandel and Cengarle 1999). 4. Iterator expression collect is often used implicitly, because PropertyCallExp «source.property», where source is a collection (e.g., «team.member») is in fact a syntactic shortcut for «source-> collect (t t.property)» (e.g., «team->collect(t t. member)»). 5. Multiple iteration variables, such as in «c->forall (v1,v2 v1 <> v2)», are allowed for some expressions, but that is just a syntactic shortcut for nested calls: «c->forall(v1 c->forall(v2 v1 <> v2))». 6. Collection operations define variables (iterators and accumulator) which are local (they are valid in the subexpression only). Other variables can be referenced For translation to XPath, property 2 implies that it is sufficient to show, how to translate closure and iterate, other operations can be defined using iterate. If we succeed, property 3 ensures that we can check constraints with nontrivial expressive power, incl. transitive closures. Property 5 relieves us from the necessity to deal with expressions with multiple iterators. However, property 6 implies that we have to deal with local variables for iterator expressions. There is no operation similar to iterate in XPath. However, we will show that iterate, and consequently all the other iterator expressions, can be implemented as XSLT higher-order functions. Higher-order functions (HOFs) are a new addition proposed for the drafts of the common XPath/XQuery/XSLT 3.0 data model, which introduces a new kind of item function item. With function items, it is possible to: 1. assign functions to variables, pass them as parameters and return them from functions, 2. function items can be called, 3. declare anonymous functions in expressions. HOF is a function, which expects a function item as a parameter or returns a function item as a result. OCL loop expression can be looked upon as HOF as well they all expect a subexpression (body, seefig.6), which is evaluated (called) repeatedly for each member of a collection. Property 6 mentioned above is important for the semantics body subexpression can have free variables, which are, when evaluated, bound to variables defined in the source of the loop expression. E.g., in the expression PSM IC3 «self.team->f orall(t t.host.organization = self.sponsor)», the body expression refers to two variables self and t. Variable t is the iteration variable, variable self is free in body. Principle 9 IterateExp defines two variables, an accumulator and an iteration variable. IteratorExp defines one variable an iteration variable. Translation of both Iterate- Exp and IteratorExp must correspond to Principle 2, i.e., these variables must be available as XPath variables in the translation of the body expression. Figure 8a shows how iterate is implemented in OclX. It is a higher-order function, expecting a function item of two arguments in its third parameter body. The draft of XSLT 3.0 (W3C 2012c) introduces new instruction xsl:iterate, which we can use to our advantage. The

170 Inf Syst Front Fig. 8 OclX implementations of LoopExp translations. a Functions iterate and exists. b Function closure a b function item is called repetitively for each member of the collection (line 10), with the two expected arguments a member of the collection and the current value of the accumulator. When body was defined as an anonymous function item, the free variables it contains are bound to the variables available in the calling expression, which is in accord with the semantics of loop expressions of OCL. The second part of Fig. 8a shows how HOF exists can be defined in terms of HOF iterate. The definition utilizes an anonymous function node (line 23), which calls the function item passed as argument. Principle 10 Every IterateExp (callof iterate) is translated as a call of OclX HOF iterate. Every IteratorExp (call of some iterator expression, such as exists etc.)is translated as a call of an OclX HOF of the same name. OclX contains a HOF definition for each predefined iterator expression. Subexpression body is translated separately and the resulting X body is passed as an anonymous function item to the HOF call. 8 To conclude the part about the iterator expressions, we will address the operation closure. The syntax for closure is the same as for other iterator expression, but the difference is that the semantics of closure is not defined in terms of 8 Some iterator expression can be in some cases translated using native XPath constructs without the need to call a HOF, e.g. exists can be translated using some/satisfies expression. Due to the space limitations, we do not discuss this sort of rewriting of the queries in this paper. Our experimental implementation (Klímek et al. 2012) allows the user to choose where several translations are possible. iterate whereas the amount of iterations needed to compute iterate is fixed, closure computes a transitive closure of the body subexpression (the resulting collection must be in depth first preorder) thus, it is not known, how many calls of body will be required. Again, there is no equivalent construct in standard XPath. The implementation of closure in OclX is depicted in Fig. 8b. 5.4 Tuples In this subsection, we show how OCL expressions using tuples (anonymous types) can be translated to XPath. OCL allows the modeller to combine values in expressions into tuples. Tuples have a finite number of named parts and are created using TupleLiteralExp, a specialization of Literal- Exp. An example of a tuple literal may be «Tuple {firstname = Jakub,lastName = Maly,age= 26 }». The values of the parts may be of arbitrary type, including collections and other tuples. The names of tuple parts (firstname, last- Name, age in the example) must be unique and are used to access the parts of the tuple in expressions, similarly to attributes of classes (using. notation), i.e. it is possible to write e.g. «employees > collect(e Tuple {name = e.name,salary = e.salary}) > select(t t.salary > 2000)» The result of this expression would be a collection of tuples. Tuples are also closely related to operation product, which computes a cartesian product of two collections: product(c1:collection(t1), c2:collection(t2)) = self >iterate(e1; acc = Set{} c2 >iterate(e2; acc2 = acc acc2 >including(tuple{first = e1, second = e2})))

171 Inf Syst Front The result of product is a collection of type «Collection(T uple(f irst : T 1, second : T 2))», which contains all possible pairs where the first compound comes from collection c1 and the second from collection c2. This operation thus finalizes the suite of equivalents of the constructs required for a language to be relationally complete (see Codd (1972)): 1. Select can be expressed using select iterator expression, 2. Project can be expressed using collect iterator expression that creates a tuple with the projected attributes (similarly as in the employees example above, which, in fact, performs projection to attributes name and salary), 3. Union OCL has union operation as well, 4. Set difference OCL has operation - working on sets, 5. Cartesian product can be expressed using product, 6. Rename can be expressed using collect in the same manner as project operation. Thus, not only tuples can be used to write more concise expressions. Together with the operation product, they increase the expressive power of the language to relational completeness (see Mandel and Cengarle (1999) for more on expressive power of OCL). We represent tuples in XPath using map items. A map item is an additional kind of an XPath item which was added in the Working Draft of XSLT 3.0 (W3C 2012c) (and is already implemented in Saxonica (2012)). Map items use atomic values for keys and allow items of any type as values. These properties of map items make them good candidates for representing tuples. Strings containing the name of a tuple part can be used as keys (and the names of parts must be distinct in an OCL tuple as well). The tuple from the example would be represented as map{ firstname := Jakub, lastname := Maly, age := 26}, expression «t.firstname» would be represented as $t( firstname ). A value in a map can also be another map or sequence, which is consistent with semantics of OCL tuples. Operation product can be defined either by translating the definition from specification (using two nested iterates) or via a much more succinct XPath expression: for e1 in collection1 return for e2 in collection2 return map{ first := e1, second := e2} Principle 11 summarizes translation of tuples. Principle 11 A tuple literal is translated into an XPath map item literal. Every tuple part is translated as a key/value pair in the item literal, the type of the key is string and the value of the key equals the name of the tuple part. Access to tuple parts is translated as indexing the tuple with a string corresponding to the accessed part. Neither of the examples in the previous section uses Let- Exp or tuples. We will demonstrate their usage on another example here. The expression PSM IC7 in Fig. 9 verifies that an employee has at most two concurrent internships in different departments. The expression first computes an auxiliary variable internship, which contains a tuple for each employee in the organization. The tuple has two parts employee and departments (the set of departments where the employee works as an intern). The type of internships is thus «Set(Tuple(employee:Employee, departments:set(department )))», which we abbreviate to InternshipsSet in the expression. The full translation of constraint PSM IC7 is depicted in Fig. 11 later in this section, together with the translations of the other constraints from the previous section. 5.5 Error recovery OCL as a language has a direct approach to run-time errors or exceptions. Errors in computation cause the result of the expression to be invalid a special value, sole instance of type OclInvalid. It conforms to all other types (i.e. it can be assigned to any variable and can be a result of any expression) and any further computation with invalid results in invalid except for operation oclisinvalid. 9 It returns true when the computations results in invalid and false otherwise. This operation thus provides a very coarsegrained errorchecking mechanism available in OCL. Unlike OCL computation, XPath/XSLT 2.0 processor halts when it encounters a dynamic error and there is no equivalent of oclisinvalid. It is also not possible to instruct it to jump to the validation of the next IC when a computation of one expression fails. XSLT 3.0, however, introduces new instructions xsl:try and xsl:catch which provide means of recovery from dynamic errors. With these instructions, it is possible to implement oclisinvalid as depicted in Listing 10. We, again, utilize higher-order functions capabilities the expression is evaluated in a function call wrapped in try/catch. OCL expression «oclisinvalid(1/0)» can be translated to oclx:oclisinvalid(function(){ 1 div 0 }). Optionally, our validation pipeline (fully introduced in Section 7) allows to safe-guard the evaluation of each expression using try/catch, so that the validation of another constraint may continue if a runtime error occurs 9 To be accurate, another operation oclisundefined behaves equally to oclisinvalid when the argument is invalid, but it also returns true, when the result of the computation is null

172 Inf Syst Front Fig. 9 LetExp and tuples example and it is not contained by oclisinvalid. In the debug mode, detailed info is given using xsl:message (Fig. 10). Principle 12 Calls of functions oclisinvalid and oclisundefined are translated into calls of corresponding OclX HOFs, implemented using try/catch instructions. Usages of invalid literal are translated into calls of invalid(). To conclude this section, we show the translation of the integrity constraint from the previous sections into Schematron schemas (one schema for each PSM schema from Figs. 2a and3a, b). The output of the automatic translation is depicted in Fig Discussion of completeness and correctness of the translation In this part we discuss the limits of the approach and correctness of the translation of PSM OCL expressions to XPath expressions. Firstly, the translation algorithm described in this section does not support some expressions allowed by the specification. Here, we identify the supported subset. The diagram 6 depicts the kinds of expressions OCL specification supports. From these, we excluded Message- Exp and StateExp. That is because we deal with static models and OCL messages and states do not have any counterparts in expressions over XML documents. We also did not elaborate on TypeExp which can be used both in type introspection and type casting. Our implementation does allow the use of TypeExp for casting, but we only support Fig. 10 Implementation of ocli sinvalid using xsl:try/xsl: catch such expressions, where the type can be known statically. The standard XPath/XQuery data model (W3C 2012b) does not contain constructs for type introspection, but this area is a subject of recent research (Holstege 2012). If standardised, our algorithm may be extended in this direction in our future work. We also do not support nested collections. The problem of nested collections is that they do not have a straightforward representation in XPath. XPath data model knows only flat sequences and this is unlikely to change in the future. It would be possible to work around this limitation by representing nested sequences using nodes, e.g. ((1,2),(3),()) would be represented as: <Sequence> <Sequence>1 2</Sequence> <Sequence>3</Sequence> <Sequence /> </Sequence> but this approach bears the same problems that were enumerated in Section 5.4 (i.e., it works only for atomic types, with nodes, node identity and navigating to parent nodes becomes an issue). Another solution would be to use map items and represent the sequence above as a map literal: map{ s := (map{ s := (1,2)}, map{ s := (3)}, map{ s := ()})} The expression returning number two would be written as: (map{ s := (map{ s := (1,2)}, map{ s := (3)}, map{ s := ()})})( s )[1]( s )[2] This approach does not have the problems as the one with nodes (because maps have no identity and nodes in the map still are children of their original parent), but the double indexing would prohibit us from using standard XPath operations in many cases (because those flatten nested sequences) and the result would be very far from a genuine XPath expression. Our goal was to create a translation that is readable and modifiable even in the target language. Apart from sequences, OCL uses other collection types (sets, bags and ordered sets). Sequences can be mapped naturally to XPath, the other types could be represented using either sequences or maps. We leave this for our future work (See Section 9). To verify that the translation algorithm is correct, we can examine each type of expressions and its translation and determine that its translation preserves its semantic meaning in the target language (XPath). Intuitively, we proceed

173 Inf Syst Front Fig. 11 Translation of the sample schemas with the translation so that we preserve the structure of the whole expression. The translation of each expression is created from the translations of its subexpressions, so the expression trees of the source and translated expressions are isomorphic. Subexpressions may contain free variables, which are defined higher in the expression tree. There are two types of expressions, which allow this let expression and iterator expresssions. Their translation using XPath let/return expression and HOFs has the same interpretation of free variables, as we discussed earlier in this section. This allows us to build the expression buttom-up, from its subexpressions, with the semantic meaning of the expression preserved providing that the semantic meaning of the subexpressions is preserved during their respective translations. Besides atomic types (which all have corresponding types in XPath), we have to examine translation of collections. Putting an item into an XPath map or sequence does not affect the item s identity (the item is not, e.g., copied) or parent/child relationship in the tree, thus XPath maps and sequences are good representations of OCL tuples and sequences.

174 Inf Syst Front That leaves us with if expressions, which exists in both languages with the same interpretation, and feature calls (operations and properties). It is beyond the scope of this paper to discuss the correctness of translation of each function and operator from the OCL standard library, but their translation is usually straightforward. The special cases toclass function (which we have added for PSM OCL) and property calls were discussed in depth earlier in this section. 6 Validation of inheritance and recursion In this section, we show how constraints using inheritance and recursion can be validated using OclX (by recursion we mean navigation along cycles in PSM schemas using closure). We will demonstrate them on a PSM schema in Fig. 12a, the sample constraints are in Fig. 12b. Inheritance is a common feature in UML diagrams and OCL supports inheritance by allowing calls of inherited features (operations and properties, via FeatureCall- Exp) and rules of type conformance. The subexpression «m.phone» fromfig. 12b is legal because class Manager inherits attribute phone from class Employee. The semantics of OCL also defines that invariants defined in the superclass apply also for all its subclasses. Thus, the invariantpsm E2 definedforclass Employee must also be met by instances of class Manager. At the PSM level, we support inheritance as well (Klímek and Nečaský 2012) a class can inherit from another class which means that the element corresponding to the specific class will have all the attributes and subelements defined by the attributes and content of the general class (the inherited subelements come before the specific class own subelements). This corresponds to the requirement that inherited features can be used in OCL feature call expressions. Inheritance in modeling requires to define how inheritance conflicts are resolved. We only deal with XML data in our model, so we do not need to be concerned with conflicts of operations. Conflicts of attributes may appear (i.e., when a subclass defines an attribute with the same name as some attribute in the parent class), but these will be identified when the PSM schema is translated into an XML schema (XSD or Relax NG). Such a schema is invalid and we consider the model erroneous Similarly, we rely on XML schema validator in other conflicts, our definitions do not, e.g., require attributes of the same class to have distinct names, even though it is not possible to translate such class into a valid XML schema. When translating OCL expressions over a schema which contains a class hierarchy we must ensure that invariant O defined for a superclass C are also checked for subclasses C x. This can be achieved by: 1. copying the rule R obtained from translation of O for every subclass. (It makes the resulting schema larger and less transparent.) 2. combining all the occurrences of C and C x into the context of R using the XPath union operator (e.g. use the expression //employee //manager). The translation of R is not repeated, but the resulting schema does not visibly show that the PSM schema utilizes inheritance. 3. utilizing the feature provided by Schematron for rule logic reuse abstract patterns. Unlike the previous options, this one does not require the context variables to be named the same in the general and in the specific invariants. We decided for the last option since it preserves the nature of inheritance. The rules for shared invariants are declared in abstract patterns and called by the patterns for derived classes. For every class participating in a generalization as a super class, an abstract pattern is generated. For every non-abstract class, which inherits from the class (and for the super classes themselves), an instance pattern is generated. Principle 13 Rules obtained by translation of invariants where the context is a class for which specialized classes exists are placed in abstract Schematron patterns. An instance pattern calling the abstract pattern is created for each subclass. Listing 13 shows a translation of invariants from Fig. 12a. Constraint PSM E2 for class Employee is translated into an abstract pattern Employee. This abstract pattern is called both for instances of Employee (via pattern Employee-as-Employee) andmanager (via Manager-as-Employee). The schema in Fig. 1 also contains a recursive association (class Department is the only participant in the association), which defines the hierarchy of departments. This association is also represented in the PSM schema in Fig. 12a by the cycle Department-Subdepartments- Department. This must be reflected in validation. The expression defining the context of rule PSM R1: company/departments/descendant::department utilizes descendant axis. OCL constraints concerning recursive structures often utilize closure iterator expressions. We have shown how closure is implemented in OclX earlier in this paper (see Listing 8b).

175 Inf Syst Front a b Fig. 12 Sample company schema PSM level. a PSM schema company hierarchy. b PSM ICs for company schema 7 Evaluation and implementation In this paper, we have shown, how PIM OCL ICs can be used to validate XML constraints. We start by translating the constraints defined at the PIM level into PSM level constraints in those schemas, where the constraints are relevant (See Section 4). The PSM level constraints can be automatically translated into Schematron schemas and validated (See Sections 5 6). 7.1 Implementation As has already been mentioned, we have implemented the proposed technique in a tool called exolutio (Klímek et al. 2012). In general, it is a proof-of-concept desktop application for conceptual XML data modeling. It implements the PIM and PSM modeling languages and operations for evolution of the PIM and PSM schemas described in Nečaský et al. (2012a), as well as support for IC modeling on both PIM and PSM level. The implementation of our approach has several parts. We extended our tool with support for ICs. The PIM schema editor was provided with OCL editor and syntax parser following OCL specification. Similarly the PSM level editor was provided with OCL editor, but for the PSM level, we extended the grammar of OCL expressions with support for PSM-specific constructs content models, and also added new possibilities of navigating association ends utilizing PSM s tree hierarchy. The PSM OCL parser outputs an abstract syntax tree, which is consumed by the translation component and translated into Schematron schemas. The workflow is depicted in Fig. 14. When the user specifies ICs at the PIM level (1), the tool can help him to transfer them to the PSM level the tool offers automatic translation where possible (2). Apart from ICs transferred from the PIM level, it is possible to create expressions solely for the PSM level (3). The next step is to generate Schematron schema from PSM OCL (4). This step is automatic, the user may choose between schema-aware and non-schema-aware (which add data conversion for extracting typed values from the XML document) schema. The schema generated in step (4) can be further tweaked the tool looks for possible expression rewritings and offers alternative translation where possible. The user can use a GUI to select the translation of each subexpression. The generated schema can be then used to validate an XML document. XProc (W3C 2010) pipeline is then used to perform the validation. It first executes the transformation steps from standard Schematron pipeline (5), adds includes for OclX library (6) and then validates the document (7) with the resulting XSLT. The pipeline expects the schema (5) and validated document (8) on its input ports and writes validation result a SVRL document to its output port (9). The tool itself (incl. examples), OclX library and the XProc pipeline are all available for free download on the tools website (Klímek et al. 2012).Figures11 and13 depict the translation of the integrity constraints from the examples in this paper (schema-aware version of the translation). The efficiency(performance) of validation of XML documents using our approach in fact depends on the capabilities and internal design of the validator. Usually, Schematron validators use XSLT stylesheet for validation and XSLT

176 Inf Syst Front Fig. 13 Translation of constraints from the company hierarchy schema (Fig. 12a) processors load the whole source XML document into internal memory. XSLT 3.0 Working Draft introduces streaming processing (W3C 2012c) and pioneering implementations are available (Saxonica 2012)(Fig.14). 7.2 Evaluation case study: National register for public procurement Our case study is the National Register for Public Procurement (NRPP). 11 It is a governmental information system intended for publishing data about public contracts by public authorities in the Czech Republic. Publishing a contract is only obligatory when the contracted price exceeds a level 11 (in Czech only) given by the current legislation; otherwise, it is optional. Authorities send contract information formatted according to one of the 17 XML formats accepted by the NRPP. This includes, e.g. XML format for contract notifications, supplier selection notifications, etc. Currently, the NRPP only provides a textual documentation for the XML formats and a set of sample XML documents. The integrity constraints are described in the documentation of the individual XML formats. We created a PIM schema and PSM schemas for the individual XML formats. Let us note that the actual schemas are more extensive than the excerpt in Figs. 15, 16 due to space limitations, we present here only those parts that relate to this paper. We also omit data types in the figures. The resulting PIM schema is depicted in Fig. 15.

Inf Syst Front (agreed), and a final real price known after finishing the contract (final). For the selected part of the problem domain, following integrity constraints apply: Fig.

177 Inf Syst Front (agreed), and a final real price known after finishing the contract (final). For the selected part of the problem domain, following integrity constraints apply: Fig. 14 XML validation using OCL OclX and Schematron pipeline The PIM schema contains classes which model public contracts (class Contract), contracting authorities, and suppliers (class Organization). There are also some additional concepts modeled prices (class Price) and contact information (class Contact). There are several relationships modeled with associations. A supplier is associated with a contract via win association end. A contracting authority is associated with a contract by a path of association ends main and contact_org. Each contract has additional contact information where documentation for the contract is provided (docs) and where bids for the contract are collected (bids). Finally, there are 4 different prices (modeled by class Price) expected price (expected), the best offered price (offered), price agreed by a selected supplier and contracting authority 1. If there is a winning bid for a contract, the bid must contain the offered price and the contract must also specify the agreed price 2. If there is a winning bid for a contract, then the offered price must not be lower than the agreed price. This constraint aims against potential suppliers who bid very low price for an offer, but later agree with the contracting authority on a price that is actually higher. 3. The winning bid is not enumerated among the other offers in Contract.offer. 4. An organization must not play the role of a contracting authority and supplier in the same contract. 5. If a supplier S supplies a contracting authority B then there must not be another contract where B plays the role of a supplier and S plays the role of a contracting authority. 6. All offers must be submitted before the contract s deadline. 7. The main contact for a contract must be connected to an organization registered in the system and at least one contact person. 8. If a supplier has exceeded the agreed price for a contract in the past (the final price was higher) his further bids for other contracts require examination. Fig. 15 Sample company schema PIM level

178 Inf Syst Front a c b Fig. 16 PSM schemas modeling the NRPP domain Figure 15 shows how these constraints are expressed in OCL. We have selected two PSM schemas for the formats used by NRPP. They are depicted in Figs. 16b and c. Because many PSM schemas share common constructs, we put the reused constructs into one schema that is referenced from the others (Fig. 16a). The PSM schemas were created according to the textual documentation and XML examples. Let us note that we use alternative graphical notation for inheritance and non-tree associations. When a class inherits from another class, the name of the general class is in the upper-right corner of the class header (e.g. in schema (b), Supply inherits from Bid). Non-tree associations are shown as regular associations, but the child class has dashed borders (e.g. there is only one class Price in schema (a), the other Price rectangles in schemas (b) and (c) are references to this one class). The PSM schema depicted in Fig. 16b models an XML format for notifications about the supplier selected for the contract; it contains the main contract contact, and information about the offered bids (with offered prices), selected supplier, and agreed price. The PSM schema depicted in Fig. 16c models an XML format for documents containing details of one contracting authority, the list of his published contracts and the details of each of the contracts (incl. offered bids and the winning bid). Figure 17 depicts how our algorithm deals with the PIM constraints and adapts them for the two PSM schemas. We can see that not all the constraints were translated for both of the schemas. Constraints 1 and 2 are not covered by schema (c), because the agreed price is not represented in the schema. Constraints 5 and 8 cannot be translated, because it would requireto traverse the association between classes Supplier and Bid (resp. Organization and Bid) upwards and the represented PIM association end has cardinality 0... Indeed-checking the constraints 5 and 8 in these two schemas is not meaningful, since the constraint operates on all contracts in the database and neither

179 Inf Syst Front a b Fig. 17 PSM constraints for the NRPP domain. a Constraints for schema 16b. b Constraints for schema 16c of the two schemas models a document which contains all contracts. Constraint 6 cannot be translated, because the schemas do not cover Contract.deadline. Constraint 7 cannot be translated for schema (c), because it does not cover class Person. The remaining cases were translated successfully. For some (e.g. 1), the direct translation algorithm (Algorithm 1) is sufficient. In many other cases, the filtering of meaningful candidates must be performed (e.g. for Constraint 3, the algorithm selects class Supply as a context class, because the other candidate class Bid does not have an association end which represents the PIM association end win. During translation of constraints 3 and 4, traversal extension functions tosupply resp. tosupplier were utilized in order to achieve a valid OCL expression. Both these cases can be alternatively translated using comparing keys instead of comparing instances. We do not include the constraints for NRPP in the form of Schematron schemas in this paper. The reader can download them from the project s web site (Klímek et al. 2012). 7.3 Discussion In total, there are currently 16 different XML schemas used by the NRPP system. Their manual maintenance and the maintenance of the integrity constraints for each of them is very time consuming and error prone. Our approach allows to save this work significantly since the integrity constraints can be expressed only once at the PIM level. In our case study we have selected 8 constraints. They can be expressed only once at the PIM level instead of expressing each of them repeatedly for each relevant XML schema. We showed that some of them can be translated automatically using our approach to Schematron expressions for particular XML schemas. However, our approach saves work even in those cases where a constraint cannot be translated automatically and a manual intervention of a designer is necessary. A single PIM expression of each integrity constraint unambiguously specifies the constraint independently of its concrete translations for particular XML schemas in Schematron. Without the common PIM expression, there would be many different Schematron expressions of the same constraint for each specific XML schemas without obvious relevance to each other. This saves a significant portion of time of the designer and provides a better overview of the system. Even though it is not possible to evaluate the time saved explicitly by the designer (because it depends on the skills of the designer), we can analyze more in detail, at which steps the designer saves time when using our approach: Our approach automatically decides whether a given XML schema is relevant for a given integrity constraints expressed at the PSM level and, therefore, whether the constraint should be translated to a Schematron expression for the XML schema. Hence, the designer does not have to think about each integrity constraint and each XML schema. In our case study, it is 128 pairs which would have to be decided manually by the designer without our approach. Certain integrity constraints can be translated automatically for certain XML schemas to Schematron expressions. In our case study, 37 pairs (of 128 in total) were translated automatically. However, this cannot be generalized the number of such pairs clearly depends on the domain, particular integrity constraints and the structure defined by the XML schemas.

180 Inf Syst Front The rest cannot be translated automatically but our approach saves time for the designer by offering possible translations. The designer only chooses the correct one. The designer works with UML class diagrams and OCL expressions at the platform-independent and platformspecific levels. Translation of OCL expressions at the platform-specific level to Schematron expressions is automatic. Therefore, the designer does not have to switch to Schematron and can stay at more abstract levels where constraints can be expressed more straightforwardly. When a new XML schema would appear or an existing one would change the designer would have to retrieve the semantics of the constraint from these Schematron expressions. This would be very hard. Instead, using our approach the designer gets the semantics of the constraint from the expression at the PIM level. It is then easier for him to use this expression to derive the required Schematron expression. 8 Related work Existing academic works, e.g., (Wenfei and Jerome 2003; Arenas et al. 2008), in the area of integrity constraints for XML focuses mainly on the fundamental integrity constraints known from relational databases keys, unique constraints, foreign keys and inverse constraints and their mathematical properties, such as decidability, consistency, tractability (with separate results for one-attribute vs. multiattribute and relative vs. absolute keys). Paper (Fan and Libkin 2002) studies the problem of consistency of a set of ICs and a DTD, i.e. whether a finite XML document, which is valid against a given DTD and satisfies the ICs, exists. The problem is proven to be undecidable for the most general class o ICs, but for the class of unary keys and foreign keys, it is proven NP-complete. Authors of Bouchou et al. (2011) propose validation of ICs in XML using attribute grammars and automatons. Their path language is less expressive than XPath or OCL, but thanks to this limitation, the validation can be achieved in one pass of the XML document in linear time. The aim of our approach was to support the largest subset of OCL as possible and translate OCL constraints into XPath constraints with similar structure. ICs spanning several XML documents were studied in Nentwich et al. (2002).The project convertsicsinto XLink (W3C 2001) links and the consistency can be checked through verifying the validity of the generated links. Several approaches for modeling XML using UML were proposed Conrad et al. 2000, Dominguez et al. 2011, Routledge et al.2002, but they deal mainly with modeling the structure of the schemas, without debating the integrity constraints present in the model. In (Badica et al. 2006), authors propose an algorithm to translate L-wrappers, a declarative language structures combining tree patterns with logical expressions, into XSLT. In contrast with our approach, this approach is designed for semi-structured documents to look for patterns and extract information, whereas in our model, the structure of XML documents is precisely defined by the schema. Using OCL in our approach provides us with greater expressive power. OCL and UML and related technologies are being researched (Hussmann et al. 2000) at Technische Universität Dresden, which is also the coordinator of the leading open-source implementation Dresden OCL (Technische Universität Dresden). Dresden OCL research was mainly targeting relational databases platform (Demuth and Hussmann 1999). A generic framework for generating for translation OCL expressions into other expression languages was proposed in (Zschaler et al. 2014). It mentions 2 applications: OCL SQL translation and also OCL XQuery (Boag et al. 2007). The expression are translated into the target language via patterns. It expects much tighter mapping between UML model and XML schema (unlike PIM/PSM schemas used in our approach, it does not consider regular properties of schemas). The paper does not deal with problems presented in Section 4.2. The OCL SQLpatternsarebasedonDemuthandHussmann(1999), OCL XQuery on Gaafar and Sakr (2004). The authors support constructs corresponding to projections, cartesian products and restrictions in the expressions (omitting the general iteration and closures facilities). Authors of Gaafar and Sakr (2004) examine the fundamental similarities of the two expression languages OCL and XQuery. They propose a mapping from XQuery queries to OCL constraints (bottom-up approach). They show how the parts of elementary XQuery expressions can be mapped to OCL constructs, but they do not elaborate on translating definitions of and references to (local) variables, which would be interesting for queries with multiple variables (such queries correspond to more complex OCL iterator expressions, which are not mentioned in the paper, and which we translate using XSLT higher-order functions). In consequence, the full expressive power of OCL is not harnessed (for more on expressive power of OCL, see Mandel and Cengarle (1999)). An approach called Active XML (Abiteboul et al. 2008; Salem et al. 2013; Phanetal.2013) allows for embedding Web Services calls into XML documents and can serve as a platform for to utilise business logic and dynamic data. An extension for validation/verification of more complex business rules using the algorithms described in this paper is feasible.

181 Inf Syst Front 9 Conclusion and future work In this paper, we addressed the possibility of checking complex integrity constraints in XML data using Object Constraint Language. We have shown how to write OCL constraints for XML and how to obtain them from existing constraints defined over the platform-independent model. We proposed a translation from OCL into XML (Schematron) schemas, which can be used directly for validation. There are several contributions of our approach. Using our tool, it is possible to reuse the definition of an integrity constraint from the conceptuallevel easily to generate actual verification/validation code for all the places where the constraint is relevant. Thanks to automated translation process, the system designer may focus on the work at the conceptual level and make the model as accurate as possible using UML and OCL. This may be of more appeal to him than working with platform-specific languages (Schematron and XPath). Also, when the constraints are defined using OCL over a UML model it is much easier both to create and maintain them. This is allowed because of the tight connection between the two languages. When the model changes, it is very easy to find out which constraints were influenced by the change. A keen tool may even suggest/perform corrections automatically. Last, but not least, it is evident that constraints described at the PIM level are much clearer and easier to understand (even for a non-technical person) than their translation to XML schemas (compare the constraints in Fig. 1 and their translation in Fig. 11. Thisisinfact a specific case of a general motivation for Model Driven Development as a whole. Improving constraint conversion In our future work, we want to further enhance our PIM PSM translation algorithm. We want to focus on solving the problem of upwards navigation from redundant instances (as suggested in the last part of Section 4). The possible solutions are 1) rewriting the expression for a different context class while preserving the semantic meaning or 2) preserving the context, but rewriting the problematic navigation step using other OCL constructs. Comparing keys may be utilized in those cases where comparing instances does not work, because there are redundant instances. We may not find a solution to translate every PIM constraints but we may be able to further extend the class of constraints we support. Besides PIM PSM direction, we will also examine the reverse direction (investigation of the PSM constraints and extracting PIM constraints from them) and also the possibility of extracting the PSM constraints from the existing Schematron schemas (reverse-engineering the constraints). Support more OCL constructs Our OCL XPath conversion does not support all OCL constructs. We excluded nested collections (because they are too alien in XPath and adding them would force us to redefine all fundamental XPath operations, such as collection concatenation), but other types of collections besides sequences could be supported. Also, there are type expressions, which could be useful as well. Document adaptation The follow-up research will also examine the possibilities of using OCL to address the problem of document adaptation required after schema evolution. We examined this topic in our previous work (Malý et al. 2012; Malýetal.2011). We proposed an approach for structural adaptation based on mapping between the versions of schemas. We also identified several scenarios, where adaptation algorithm can not adapt the documents without user interaction these are related to such kind of changes between the versions, which deal more with content than structure. These scenarios could be solved via using integrity constraints and OCL to annotate the mappings. In this article, we have shown how OCL and integrity constraints can be used for validation. In the futureresearch, we want to utilize them for adaptation as well. Acknowledgments This work was supported by the Czech Science Foundation (GAČR), grant number P202/11/P455 and by Charles University, grant number / References Abiteboul, S., Benjelloun, O., Milo, T. (2008). The Active XML project: an overview. VLDB Journal, 17(5), dblp.uni-trier.de/db/journals/vldb/vldb17.html#abiteboulbm08. Arenas, M., Fan, W., Libkin, L. (2008). On the complexity of verifying consistency of XML specifications. SIAM Journal of Computer, 38, Badica, A., Badica, C., Popescu, E. (2006). Implementing logic wrappers using xslt stylesheets. International multi-conference on computing in the global information technology (Vol. 0, p. 31). Boag, S., Chamberlin, D., Fernández, M.F., Florescu, D., Robie, J., Siméon, J. (2007). XQuery 1.0: an XML Query Language, W3C. Bouchou, B., Ferrari, M.H., Lima, M.A.V. (2011). Attribute grammar for XML integrity constraint validation. In Proceedings of the 22nd international conference on database and expert systems applications volume part I, DEXA 11 (pp ). Springer, Berlin, Heidelberg. id= Booth, C.K.L.D. (2007). Web Services Description Language (WSDL) version 2.0 part 0: Primer, W3C. wsdl20-primer/. Codd, E.F. (1972). Relational completeness of data base sublanguages. In R. Rustin (Ed.), Database systems: 65 98, Prentice Hall and IBM Research Report RJ 987. San Jose, California. Conrad, R., Scheffner, D., Christoph Freytag, J. (2000). XML conceptual modeling using UML. In Conceptual modeling ER Lecture Notes in Computer Science. Demuth, B., & Hussmann, H. (1999). Using uml/ocl constraints for relational database design. In Proceedings of the 2nd international

182 Inf Syst Front conference on The unified modeling language: beyond the standard, UML 99. Springer, Berlin, Heidelberg. citation.cfm?id= Dominguez, E., Lloret, J., Perez, A., Rodriguez, B., Rubio, A.L., Zapata, M.A. (2011). Evolution of XML schemas and documents from stereotyped UML class models: a traceable approach. Information and Software Technology, 53, Eclipse Model Development Tools (MDT). modeling/mdt/. Fan, W., & Libkin, L. (2002). On XML integrity constraints in the presence of DTDs. Journal of ACM, 49(3), doi: / Gaafar, A., & Sakr, S. (2004). Towards a framework for mapping between UML/OCL and XML/XQuery. In UML. Holstege, M. (2012). Type introspection in XQuery. In Proceedings of Balisage: the markup conference 2012, Balisage series on markup technologies. Mulberry Technologies. Hussmann, H., Demuth, B., Finger, F. (2000). Modular architecture for a toolset supporting OCL. In UML: advancing the standard, UML 00. Springer. ISO/EIC (2006). Information technology Document Schema Definition Languages (DSDL) part 3: rule-based validation schematron. ISO/IEC Klímek, J., Malý, J., Nečaský, M. (2012). exolutio project. exolutio.com. Klímek, J., & Nečaský, M. (2012). Formal evolution of XML schemas with inheritance. In Proceedings of international conference on web services Malý, J., Mlýnková, I., Nečaský, M. (2011). XML data transformations as schema evolves. In ADBIS 11: proceedings of the 15th advances in databases and information systems. Springer, Vienna. Malý, J., & Nečaský, M. (2012). Utilizing new capabilities of XML languages to verify OCL constraints. In Proceedings of Balisage: the markup conference 2012, Balisage series on markup technologies. Malý, J., Necǎský, M., Mlýnková, I. (2012). Efficient adaptation of XML data using a conceptual model. Information Systems Frontiers, doi: /s Mandel, L., & Cengarle, M. (1999). On the expressive power of OCL. In FM 99 Formal Methods, Springer (Vol. 1708). Miller, J., & Mukerji, J. (2003). MDA Guide Version 1.0.1, Object Management Group. pdf. Murata, M. (2002). RELAX (Regular Language Description for XML), ISO/IEC DTR Murata, M., Lee, D., Mani, M., Kawaguchi, K. (2005). Taxonomy of XML schema languages using formal language theory. ACM Transactions, 5(4), Nečaský, M., Klímek, J., Malý, J., Mlýnková, I. (2012a). Evolution and change management of XML-based systems. Journal of Systems and Software, 85(3), com/science/article/pii/s Nečaský, M., Mlýnková, I., Klímek, J., Malý, J. (2012b). When conceptual model meets grammar: a dual approach to XML data modeling. Data & Knowledge Engineering, 72, sciencedirect.com/science/article/pii/s x x. Nentwich, C., Capra, L., Emmerich, W., Finkelstein, A. (2002). xlinkit: a consistency checking and smart link generation service. ACM Transactions on Internet Technology, 2(2), doi: / Object Management Group (2007). UML Specification. Object Constraint Language Specification (2012). OMG. omg.org/spec/ocl/2.3.1/. Phan, B., Pardede, E., Rahayu, W. (2013). On the improvement of active XML (AXML) representation and query evaluation. Information Systems Frontiers, 15(2), doi: /s z. Richters, M., & Gogolla, M. (2002). OCL: syntax, semantics, and tools. In Object modeling with the OCL, Springer. (Vol. 2263, pp ). Routledge, N., Bird, L., Goodchild, A. (2002). UML and XML schema. In ADC 02, ACS. Salem, R., Boussaïd, O., Darmont, J. (2013). Active XML-based Web data integration. Information Systems Frontiers, 15(3), doi: /s Saxonica (2012). Saxon XSLT Processor 9.4., sourceforge.net/. Sparx Systems. Enterprise Architect. au/products/ea/index.html. Technische Universität Dresden. Dresden OCL. dresden-ocl.org. Tim Bray, C.M.S.-M., & Paoli, J. (2000). Document type declaration. W3C (2001). XML Linking Language (XLink) version 1.0 recommendation. W3C (2010). XProc: an XML pipeline language recommendation. W3C (2011). XML Path Language (XPath) 3.0, working draft W3C (2012). XML Schema schema-1/. W3C (2012). XQuery and XPath data model 3.0, working draft. W3C (2012). XSL Transformations (XSLT) version 3.0, working draft Wenfei, F., & Jerome, S. (2003). Integrity constraints for XML. Journal of Computer and System Sciences (JCSS), 66(1), Zschaler, S., Demuth, B., Schmitz, L. (2014). Salespoint: A Java framework for teaching object-oriented software development. Science of Computer Programming, 79, doi: / j.scico , Jakub Malý received his Ph.D. degree in Computer Science in 2013 from the Charles University in Prague, Czech Republic, where he studied as a Ph.D. student at the Department of Software Engineering. His research areas involve conceptual modeling of XML data and evolution of XML applications, integrity constraints in models and Object Constraint Language. He has published 4 journal and 10 conference papers (one received Best Parer Award). Martin Nečaský received his Ph.D. degree in Computer Science in 2008 from the Charles University in Prague, Czech Republic, where he currently works at the Department of Software Engineering as an assistant professor. He is an external member of the Department of Computer Science and Engineering of the Faculty of Electrical Engineering, Czech Technical University in Prague. His research areas involve XML and Linked data design, integration and evolution. He is an organizer/pc chair/member of more than 10 international events. He has published more than 50 papers (three received Best Paper Award). He has published 3 book chapters and a book.

183 Chapter 8 Methodology Evaluation - Case Studies In this chapter, we provide two case studies to evaluate the methodology presented in Section 1.4. Section 8.1 presents the case study from the electronic health (e-health) domain. It evaluates all three parts of the methodology and measures the amount of work which can be saved by XML schema designers when the methodology is applied. Section 8.2 contains the second case study from the domain of public procurement. This study shows how three different qualities of XML schemas of a real-world software system could be improved when the methodology is applied. The three qualities are readability, integrability and adaptability of XML schemas. We explain the qualities in the respective section. Let us note that the study from the domain of public procurement has been partly published in [37]. This part is therefore covered in the end of Chapter 5. Here, in Section 8.2, we present the other part of the case study which we published in [35]. 8.1 Case Study - ehealth In this case study, we experiment with the Data Standard for ehealth in the Czech Republic (DASTA) 1. The data standard comprises 10s of XML formats for exchanging information about patients, drugs, hospitals, medical examinations, etc. Only XML schemas and textual documentation is provided by 1 (in Czech only) 171

184 the DASTA authors. 0..* Drug name generic_name code Person first_name surname title Vaccination type description dosage_id batch_id vaccination_date 0..* author Doctor doctor_id specialization function Certificate electronic_id print_id issue_date 0..* 0..1 Diagnosis type code 0..* findings_date specification 0..1 final entering * author 0..* 0..* author WorkDisability evidence_id cssa_id valid_from injury sickness interrupted expiration_date Physiology height weight measure_date Insurance insurance_id insurance_type valid_from valid_to 0..* * Patient patient_id birth_number birth_date death_date sex occupation nationality citizenship education blood_type rh_factor Address street street_number zip city country InsuranceCompany code title 0..* Figure 8.1: PIM schema of ehealth domain in the Czech Republic To be able to evaluate the methodology presented in Section 1.4, our first goal was to design the PIM schema on the basis of the textual documentation provided by the DASTA authors. Its part is shown in Figure 8.1. It models patients and doctors, insurance, diagnoses, vaccination and working disabilities. The whole PIM schema contains 76 classes, 124 associations and more than 300 of attributes. We first evaluated our forward-engineering methodology. We chose 4 different XML formats for representing various kinds of XML messages related to manipulation with drugs (prescription, prescribed drugs, dispensation and vaccination detail). We did the experiment with 2 XML schema designers. Each created PSM schemas of two of the XML formats from the provided PIM schema. The resulting PSM schemas contain 42 classes, 38 associations and 76 attributes in total. We discussed the results with the designers as well as with the authors of DASTA. Their experience and evaluation is summarized below: 1. A designer saves a significant amount of time. The 4 XML schemas contain 106 declarations of XML elements and attributes (the declarations are modeled by all attributes and some associations as explained 172

185 in Section 1.3.2). Instead of encoding them manually, which takes a lot of coding time, the designer just does drag and drop operations. He drags and drops classes and associations from the pre-defined PIM schema to the PSM schemas (i.e drag and drop operations). For each of the 42 classes he chooses required PIM attributes from an automatically displayed list. These are automatically moved to the PSM schema. 2. The number of errors in the modeled XML schemas and semantic inconsistencies among them caused by different understandings of the domain by the XML schema developers is reduced for each of the 156 PIM components thanks to the common PIM schema. 3. The accuracy of the resulting XML schemas can be easily checked by a domain expert since he can validate each declaration in the XML schemas against the PIM schema visually without studying complex and technical XML schema code. The tool automatically shows a mapping of each chosen declaration to the corresponding PIM component. Therefore, the time needed to find the corresponding PIM component is significantly reduced. 4. It can be easily checked whether any of more than 500 of medicinal concepts modeled in the whole PIM schema is represented in the XML schemas and how. Vice versa, having one of the declarations from the resulting XML schemas it is easy to see in the PIM schema what medicinal concept it represents. Even though it is hard to exactly measure the time saved and errors reduced, the experiment demonstrates that the savings are significant. The time the designer needs to manually encode an XML element or attribute declaration is much longer than the time required to dragging and dropping components from the PIM schema to the PSM schemas. Even more time is saved when the domain expert checks the validity of the created XML schemas against the conceptual representation expressed in the form of the PIM schema. We also evaluated our reverse-engineering methodology. We chose 8 XML schemas from the rest of the DASTA XML schemas (i.e. from those we did not use for the forward engineering evaluation). According to the methodology, we converted them to their PSM equivalents. This is an automatic 173

186 process without participation of the designer. The resulting PSM schemas contained 62 classes, 54 associations and 93 attributes. Most of them were then mapped to their PIM equivalents automatically. This is not surprising because the PIM schema was created manually according to the XML schemas. Therefore, the names of the XML elements and attributes declared in the XML schemas correspond to the names of classes, attributes and associations in the PIM schema. If the reader is interested in how our reverse engineering method is successful in cases where the names in the PSM schemas do not exactly correspond to the PIM schema we refer to [28]. There were several inconsistencies which could not be resolved automatically and the mapping had to be created manually. This included approx. 20 % of PSM components. The revealed inconsistencies resulted in proposals of changes to the original XML schemas so that they could be made semantically consistent with the rest of the DASTA XML schemas. For example, it has appeared that 3 different kinds of medical doctor identifiers were used in three different XML schemas. We discussed the PIM and PSM schemas with the DASTA authors and they discovered other inconsistencies which they did not see while reading the XML schemas (different structures for addresses, identifiers, etc.). The evaluation shows that the reverse engineering process helps to reveal various errors and semantical inconsistencies among XML schemas in a given set. The inconsistencies are discovered mainly during the mapping of automatically generated PSM schemas to the PIM schema. The components which cannot be mapped automatically are candidate inconsistencies and need to be inspected by the designer. Most of them are hard to reveal when only XML schemas exist which need to be explored manually. On the other hand, our approach requires the existence of the PIM schema its creation takes some time. And, finally, we also evaluated the evolution methodology. The DASTA standard is changed four times a year by the authors. This makes 24 versions since We do not discuss all the changes which appeared in these 24 versions. Instead we select one of the recent versions which included an interesting change replacement of one of the original DASTA XML formats related to working disabilities with another XML format for the same kind of messages used by the Czech Social Security Administration (CSSA) 2. In this experiment, we evolved the PSM schema of the original XML format to

187 a new version which modeled the XML format used by CSSA. The old one is depicted in Figure 8.2(a). The new one is depicted in Figure 8.2(b). The figures contain only parts of the PSM schemas the original PSM schemas are more complex (they contain 10/29 classes, 9/28 associations, and 14/41 attributes, respectively). DisabilitySchema DisabilitySchemaCSSA dasta DASTA dasta DASTA ip ip Patient pn pn Disabilities 0..1 pnz start 0..1 end 0..1 dg Diagnoses hpn pat 0..* exp Expiration expiration_date expiration_reason Figure 8.2: PSM schema of (a) old XML format for working disabilities and (b) CSSA XML format for working disabilities There are many differences between both versions. Some of them are displayed in Figure 8.2 in red. For example, the red arrow connecting the attribute interrupted on the left with the attributes interruption date and interruption note on the right show that the original attribute was split by the designer to 2 new attributes. On the other hand, the attribute injury has been removed. Some components were renamed, e.g., start diag code and end diag code were both renamed to code. Some components have been newly created, e.g., class Hospitalization on the right. Many changes made at the PSM level needed to be propagated to the PIM schema because they affected the conceptual representation of the medical domain. For example, the split of the attribute interrupted was propagated in this way. From here, they needed to be propagated to the other PSM schemas. (For the experiment, we had 12 PSM schemas modeled from the previous evaluation of the forward and reverse engineering methodologies 175

188 where the changes had to be propagated from the PIM level.) The designer performs edit operations on a chosen schema. The edit operations are expressed as a sequence of atomic operations of four kinds: creation, removal, update and synchronization. The designer does not use these atomic operations as they are too simple. Instead, he performs operations like split attribute which is defined as a sequence of atomic operations of all kinds. Therefore, the designer performs a single operation but, in fact, he executes a sequence of several atomic operations. In the experiment, we measure the number of atomic operations which had to be performed to evolve the old PSM schema to the new one. We also measure the number of atomic operations which were performed by our propagation mechanism to ensure that the PIM schema and all related PSM schemas are evolved correspondingly. We show the total number of atomic operations, the number of operations which were performed by the designer manually, and the number of operations which were performed automatically by our propagation mechanism Figure 8.3: Numbers of atomic operations performed during the evolution of the XML format depicted in Figure 8.2. The result is depicted in Figure 8.3. Figure 8.3(a) shows the number of operations performed manually by the designer to evolve the old version of the PSM schema to the new version. Figure 8.3(b) shows the number of operations automatically propagated to the PIM schema on the basis of 176

189 the manually performed operations from Figure 8.3(a). Figure 8.3(c) shows the number of operations automatically propagated from the PIM level to all other PSM schemas. The columns of each graph represent the kinds of operations creation, update, removal, synchronization. For example, Figure 8.3(a) says that there were 67 creation operations, 142 updates, etc. Let us now analyze the numbers of operations in a more detail. The numbers show that there have been 67 creation operations performed manually by the designer. However, the propagation mechanism resulted only in 52 creation operations performed automatically on the PIM schema. This is because not all creation operations at the PSM level lead to creation operations at the PIM level. The designer may create a PSM component which only models a grammatical rule without a semantical equivalent at the PIM level. The same holds for the other kinds of operations. The most significant difference is in update operations. Only 1/3 of update operations were propagated to the PIM level. This is because the designer only renamed XML components without changing the names of their PIM equivalents. In case of removal and synchronization operations, the reason is that removed or synchronized components have no equivalent at the PIM level. Figure 8.3(b) demonstrates the amount of work saved by our propagation mechanism to keep the PIM schema consistent with changes made by the designer in the PSM schema. Without propagation, the designer would have to perform all operations summarized in Figure 8.3(b) manually. If we compare the numbers with the numbers in Figure 8.3(a) we see that we saved more than 1/3 of operations. Figure 8.3(c) shows the number of operations performed automatically by our propagation mechanism on other PSM schemas after the PIM schema has been automatically updated. These operations are necessary to keep the other PSM schemas semantically consistent with the changes made by the designer in the primary updated PSM schema. The numbers show that without our propagation mechanism the designer would have to do a lot of operations manually on other XML schemas to keep them semantically consistent. We can see that more than 1/2 of the necessary operations were performed automatically by our change propagation mechanism. However, this means saving much more than 1/2 of time. For each change in the primary XML schema the designer would have to think how to propagate the change to the other XML schemas. Using our approach the designer works at the PSM level and the changes are propagated automatically through the PIM schema. 177

190 <contract_notification> <cont_oficial_title>charles Univ.</cont_oficial_title> <cont_postal_address>ovocný trh 3</cont_postal_address> <cont_city>praha 1</cont_city> <cont_zip>11636</cont_zip> <cont_country>cz</cont_country> <title>úklid vybraných objektů...</title> <price_total> </price_total> <currency>czk</currency> <VAT>20</VAT> <docs_oficial_title>otidea a.s.</docs_oficial_title> <docs_postal_address>na Příkopě</docs_postal_address> <docs_city>praha 1</docs_city> <docs_zip>110 00</docs_zip> <docs_country>cz</docs_country> </contract_notification> Figure 8.4: Sample XML document which demonstrates low readability, integrability and adaptability of NRPP XML formats. 8.2 Case Study - Public Procurement Czech National Register for Public Procurement (NRPP) 3, maintained by the national government of the Czech Republic, is intended for publishing data about public contracts by various public authorities in the Czech Republic. Publishing a contract is obligatory when the contracted price exceeds a level given by the current legislation. Otherwise, it is only optional. An authority may send contract information formatted as an XML document to NRPP. NRPP provides various XML formats intended for different situations, e.g. contract notification, supplier selection or contract finalization. Fig. 8.4 shows a sample XML document with a concrete contract notification in an XML format currently accepted by NRPP. We made a detailed study of the XML formats used by NRPP from three different points of view: readability, integrability and adaptability of the XML formats. We describe the viewpoints in detail in the rest of this section. In the end, we summarize their properties from these three viewpoints (see Tab. 8.1). For our purposes, by the term XML format we mean one of the XML formats supported by NRPP and by XML document we mean an XML 3 (in Czech only) 178

191 document formatted according to one of the XML formats, if not specified otherwise. Readability of XML Formats To be able to process XML documents, developers need to understand the syntax and also the semantics of the XML formats. They need to know what part of the reality each particular XML element or XML attribute represents and, vice versa, how a selected part of the reality is represented in the XML formats. For example, they need to know that XML element title in our sample XML format represents contract title. They also need to know that docs postal address represents a postal address, where an interested supplier might get a documentation for the contract while that cont postal address represents the main contact address for the contract. And, it is also important to know that price total represents a total price expected by the contractor and not a final contract price or another price (there are 4 kinds of price considered). By the term XML formats readability we denote the described ability of developers to understand the XML formats. First, readability may be increased by exploiting the hierarchical nature of XML which allows for using separate XML elements to nest XML content representing different objects. However, this is not the case of the studied XML formats of NRPP. For example, instead of nesting different addresses in separate XML elements, the XML document depicted in Fig. 8.4 uses prefixes cont and docs which is not very transparent. Second, XML schemas of the XML formats (expressed in, e.g., XML Schema language) should be provided. However, XML schemas only allow for describing the syntax. The semantics is not expressed explicitly and must be intuitively deduced by developers. A simple solution is to provide the developers with a textual documentation of the XML formats. A more advanced solution is to also provide them with a conceptual schema of the problem domain which describes the semantics more precisely. Integrability of XML Formats Another problem arises when the data needs to be converted from the XML formats to another data representation (e.g. a local database inside an authority s information system or other XML formats) and vice versa. Ideally, both representations are the same and, therefore, no export/import scripts are necessary. However, this is a rare case and transformation scripts need to be developed by developers. These 179

192 scripts might be, e.g., SQL/XML [4] scripts for integration with a relational database or XSLT [24] scripts for integration with other XML formats. By the term XML formats integrability we denote the measure of how easy it is to develop such scripts. To increase the integrability, it is useful when the same concept (e.g. address) is represented in the same way in the XML formats (e.g. by XML elements street, city, zip, etc.). In that case, parts of the scripts might be reused for those parts of the XML formats. However, this is not always possible (e.g. at some places of the XML formats only zip itself is present). In other cases, the name of an XML element or XML attribute needs to be different (e.g. XML elements supplier-city and contractor-city, both representing city, in an XML format which we use to report on distribution of suppliers and contractors across the cities in the Czech Republic). In these cases, a mapping of XML elements and XML attributes of the XML formats to the conceptual schema (e.g. mapping supplier-city and contractor-city to a shared attribute city in the conceptual schema) might be useful. This would allow the developers to map their local data representation schema (e.g. relational database schema) to the conceptual schema instead of mapping the schema to many XML formats. By composing the mapping it would be possible to generate the transformation scripts automatically or, at least, to help the designers with their development. Adaptability of XML formats The last but not least problem arises when a change needs to be made in one or more XML formats. We distinguish two kinds of changes. First, it may be necessary to make a change in an existing XML format such as adding an XML element or XML attribute or moving it from its current location in the XML format to another. Second, there may appear a requirement to create a completely new XML format or to remove an existing one. Changes to the XML formats may be required because of various technical reasons (e.g. to increase their readability or integrability) or because of changes in the domain (i.e. at the conceptual level). The second case is what is currently happening in the Czech Republic. A new legislation is currently being prepared and it will result into some changes in the XML formats. Not only new kinds of documents will be required to be sent to NRPP (which means creating completely new XML formats). There will also be legislative changes which will result into changes to existing XML formats. For example, a complete list of bids for a contract 180

193 Property Readability Integrability Adaptability Evaluation of NRPP hierarchical nature of XML not exploited XML schema definitions missing conceptual schema of public procurement domain missing documentation of XML formats provided (in form of Excel sheet) same concepts (e.g. address) represented in different ways integration via a common schema not possible automatic generating of transformation scripts is not possible adaptability via a common schema and automatic propagation to XML schemas not possible adaptability of integration and transformation scripts not possible Table 8.1: Summarization of readability, integrability and adaptability properties of NRPP will be mandatorily published instead of their total number which is required by the current legislative. By the term XML formats adaptability we denote the measure of how easy it is to change the XML formats and react on these changes. For example, let us consider a requirement to have a street name and number for each address in two separate strings instead of a less detailed postal address in one string. In our sample XML format depicted in Fig. 8.4, this means replacing XML element cont postal address with more detailed cont street and cont street no. Similarly, XML element docs postal address must be replaced. And other XML elements representing postal address in other XML formats need to be adapted accordingly as well. A more complex change is the one which will be caused by the new legislation. Wherever number of bids attribute from the PIM schema is represented in the XML formats, there will be necessary to include the detailed information about each particular bid and the bidder and distinguish the winning bid. To increase adaptability in this case, it is useful to have only one XML representation for a given concept (e.g. one sequence of XML element dec- 181

194 larations for addresses) and share it across all XML formats. In that case a change may be implemented only at this one place instead of a repetitive change at various places of the XML schemas. However, this is not always possible (similarly to the case of integrability). In these cases, the mappings of the XML schemas to the conceptual one would help. A change could be then made only at one place of the conceptual schema (e.g. in a class Address) and propagated to all XML schemas which involve XML representations of addresses. The adaptability is also important for the developers of information systems which communicate with NRPP. First, when a new XML format appears, they need to develop scripts which export/import their data from/to the XML format. This is a problem similar to the integrability described above. Second, when an existing XML format is changed, they need to adapt their scripts and, possibly, adapt their internal data representation (e.g. relational database). For example, replacement of XML element cont postal - address with cont street and cont street no may result into a corresponding change of local relational storage and also SQL/XML scripts which translate the relational representation into the XML formats. Similarly, replacing the number of bids with details of particular bids may lead to changes in the local storage and scripts. As in the previous case, adaptability in this case could be increased when a mapping of the XML formats required by NRPP and local database schema to the conceptual schema would exist. This could inform the developer about what parts of the local storage are affected by the change or whether the local storage needs to be extended. This could also help with adapting the transformation scripts. We summarize the properties of NRPP in Tab. 8.1 from the three viewpoints. In the following sections, we demonstrate how to improve them by adding a common conceptual schema Basic PIM and PSM Schemas A small portion of the PIM schema of the public procurement domain is depicted in Fig There is a class Contract which models public contracts. Further, there is a class Contact which models contact information. It is associated with Organization class which models organizations. Each contact is associated with one and only one organization and with many contracts. For a contract there is the main contact, contacts where documentation may be acquired, and contact where bids should be sent. These relationships 182

195 ttttttt nnnesne}nennnn essee nsnne ntnttnnttntt n}enese}eeeee SLL nnnese}snnnenn nees pen nn}nens **** ttstt ttttt t ntnt **** stt s **** snss **** syyyynestsy ***1 **** tttttttt eeeee SSSS nennneneenn nn}}nnne}}}}}}} esn}snnesn esn}nssn nesneene n}s}en}n}}}enn}}}}}}} edyet tes tlletes ***1 tntees ***1 lntty ***1 etnte n}nnenns TTS nnene}enese nnene}}nns nnene}en Figure 8.5: PIM Schema of Public Procurement Domain between contracts and contacts are modeled by associations main, doc, and bids, respectively. Two sample PSM schemas from the procurement domain are depicted in Fig. 8.6(a) and 8.6(b). They model two particular XML formats. From the conceptual perspective, components of both PSM schemas are mapped to the components of the PIM schema depicted in Fig The mapping specifies the semantics of the mapped components of both PSM schemas in terms of the single PIM schema. We do not display the mapping explicitly. In our example it can be deduced intuitively (but not in general). Mapped PSM components are shown in the brown color and the others in the grey color Improving Readability, Integrability and Adaptability We are now ready to show how we exploited it for increasing the readability, integrability and adaptability of the XML formats for public procurement in the Czech Republic. In this section we suppose the public procurement domain modeled by the PIM schema depicted in Fig NRPP uses 17 different XML formats for communication with other information systems. Therefore, we have created 17 PSM schemas specifying these XML formats and mapped them to the PIM schema depicted in Fig An example of an XML document in one of the XML formats is depicted in Fig. 8.4 as we have already discussed. One more XML format is discussed later in this section. For modeling we used our experimental implementation exolutio which may 183

66666666 llllllllllllllll 99999999 yyyyyyyyy l llllll llllllllll s yssyys ss llllllllll eeeeeeeeeeeee seseeeeesssess eeey **** lllllll syssyysss nnnnnn s yssyys s **** syssyyss eneen nsssnnsnsssnss l

196 llllllllllllllll yyyyyyyyy l llllll llllllllll s yssyys ss llllllllll eeeeeeeeeeeee seseeeeesssess eeey **** lllllll syssyysss nnnnnn s yssyys s **** syssyyss eneen nsssnnsnsssnss l llllll l **** llllllll eeeee sssseseeeesess syssyss s yssyys syy sryyrryy rnnensyenn rryyrryy eellllellll eeseesesseee ellllellll eeseeesseee syssyyssyy snsesessssnnss yney ContractorDetail contractor_detail Organization - oficial_title - postal_address - city 0..* Contact 1..* contract Contract - title - number_of_bids PriceTotal PriceRange - price_total - price_from - price_to Figure 8.6: Two sample PSM schemas of (a) XML format for contracts viewed by regions, and (b) XML format for contractor details <contractor_detail> <oficial_title> Charles Uni </oficial_title> <postal_address> Ovocný trh 3/5 </postal_address> <city>praha 1</city> <contract> <title>úklid...</title> <number_of_bids> 4 </number_of_bids> <price_total> </price_total> </contract> <contract>...</contract> <contract>...</contract> </contractor_detail> <contracts> <region>prg</region> <contract> <title>úklid...</title> <number_of_bids> 4 </number_of_bids> <contractor> <postal_address> Ovocny trh 3 </postal_address> <city>praha 1</city> </contractor> <supplier> <postal_address> Kladenska 43 </postal_address> <city>kladno</city> </supplier> </contract> <contract>...</contract> </contracts> Figure 8.7: Two sample XML documents valid against the XML format specified by PSM schemas depicted in 8.6(a) and 8.6(b), respectively 184

Evolution of XML Applications

Evolution of XML Applications University of Technology Sydney, Australia Irena Mlynkova 9.11. 2011 XML and Web Engineering Research Group Department of Software Engineering Faculty of Mathematics and Physics