A Methodology of A Database Schema Design Using The Subschemas

A Methodology of A Database Schema Design Using The Subschemas Ivan Luović University of Novi Sad, Faculty of Technical Sciences Novi Sad, Serbia and Montenegro ivan@iis.ns.ac.yu Sonja Ristić Business College Novi Sad Novi Sad, Serbia and Montenegro sdristic@uns.ns.ac.yu Pavle Mogin Victoria University of Wellington, School of Mathematical and Computing Sciences Wellington, New Zeland pmogin@mcs.vuw.ac.nz Abstract. The initial assumption is that a database schema is produced by the integration of simultaneously and independently designed subschemas. The notion of a subschema is defined using the concepts of the relational data model, according to the appropriate external schema, which is used to express a user view onto a future database, on the conceptual level. The process of simultaneous design of subschemas usually may lead to the collisions in expressing the real system constraints and business rules. If these collisions exist, some of the subschemas are not consistent in a formal sense with the database schema, which is obtained by the integration of a set of subschemas. The programs made over the inconsistent subschemas do not guarantee safe database updates. The aim of the paper is to present a process of a database schema design using the subschemas, that assure the design of the consistent subschemas and formally correct database schema. A common algorithm for checing of the constraint consistency between the database schema and subschemas, on the level of the same constraint type, is presented too. I. INTRODUCTION Some of the main problems of a database schema design are: i) how to define a set of attributes and a set of constraints that faithfully represent a real system and its business rules; ii) how to apply the complex techniques of a database schema design; and iii) how to overcome the limited perception power of a designer, when a real system is too complex. In order to relieve these problems or even overcome them, it is necessary to apply a methodological approach. A methodology of a database schema design using the subschemas is presented in the paper. After the identifying of groups of similar end users business tass, for each of these groups, an external schema is designed, as the first step of a database design process. An external schema is based on a data model, which is convenient for the conceptual design. [2]. It is difficult and sometimes even impossible to formalize the process of integration of external schemas. The quality of the resulting database schema highly depends on the designer's nowledge and sillfulness. On the contrary, a process of the integration of subschemas in the relational data model can be formalized and automated on the basis of the synthesis algorithm [2]. A subschema is defined by means of the concepts of the relational data model. It is designed using the appropriate external schema. By an integration of a set of subschemas, the potential database schema is created. If the collisions in expressing the real system constraints and business rules exist, between some subschemas, some of the subschemas are not consistent with the potential database schema in a formal sense. The subschema is a part of the transaction program specification. It represents a data structure over which a transaction program is executed. Operations of a transaction program should be executed against a database, directly. To allow safe database updates by a program made over the subschema, the subschema itself has to satisfy the conditions of the formal subschema and database schema consistency. The using of programs over the inconsistent subschemas may lead to logically incorrect database updates. One of the aims of the paper is to propose that the presented methodology of a database schema design can assure the design of the mutually consistent subschemas and formally correct database schema. The consistency conditions may be expressed for each type of constraint, separately. The paper also presents a common algorithm for the detection of constraint inconsistencies between the database schema and a subschema, on the level of the same constraint type. Apart from the Introduction and Conclusion, the paper has five sections. Section two formally introduces the notion of a subschema. Section three describes the principles of a database update using subschema concepts. The formal consistency conditions are presented in section four. Section five is concerned with the integration of subschemas, while a common algorithm for the detection of inconsistencies is presented in section six. II. THE NOTION OF A SUBSCHEMA A relational database schema is a pair (S, I), where S is a set of relation schemes and I is a set of interrelation constraints. It is supposed in the paper that the database schema is produced using a well-defined methodological approach. Each relation scheme from S is a named triple: N(R, C, K p (R)), where N is a unique name, R is an attribute set, and C is a specification of constraints. A relation scheme will be often referred simply by its name N. The specification of constraints C is a triple (K, τ (N), Uniq(N)), where K is a set of eys, τ (N) will be called tuple integrity constraint, and Uniq(N) is a (possible empty) set of uniqueness constraints Unique(N, X i ), where X i is a proper subset of R, which does not contain any ey from K. The tuple integrity constraint is a pair τ(n) = ({τ(n, A) A R}, Con(N)), whose first component contains attribute domain constraints τ(n, A) of each attribute A R. Each τ(n, A) is of the form τ(n, A) = (Dom(N, A), Null(N, A)), where Dom(N, A) is a domain constraint of attribute A R and Null(N, A) {, T} is a null-value constraint of A R. The second component of τ(n), Con(N) is a logical expression defined over the attributes from R and their domain

values. It must be satisfied by each tuple from an instance over N. A uniqueness constraint Unique(N, X i ) means that each non null value of X i must be unique in a relation over N. More details concerning the specification of constraints C may be found in [1]. K p (R) K denotes the primary ey of the relation scheme N. The interrelation constraint set I may contain various types of constraints, of which frequently used referential integrity constraint is just one. An external schema is a formal and abstract definition of data and constraints that are needed to mae a transaction program aimed at supporting the implementation of the end user business tass. It is based on a conceptual data model, lie ER model or a model based on the form types [3]. The external schemas are designed simultaneously and independently by a number of designers. On the basis of each external schema, the appropriate subschema is designed, using the relational data model. It consists of a set of relation schemas and a set of interrelation constraints. Each relation scheme of a subschema consists of a set of attributes and a set of local constraints. A role and a set of modifiable attributes are also assigned to each relation scheme. Formally, a subschema is a named pair P (S, I ), where P is a subschema name, S is a set of relation schemes, and I is a set of interrelation constraints. The set of relation schemes of a subschema P is S = {N i (R i, C i, K p (R i ), Role(P, N i ), Mod(P, N i )) i {1,..., n}}, where N i is a scheme name, R i is an attribute set, C i is a specification of relation constraints of the form (K i, τ (N i ), Uniq(N i )), where K i is a set of eys, τ (N i ) will be called tuple integrity constraint, and Uniq(N i ) is a (possible empty) set of uniqueness constraints Unique(N i, X i ), where X i is a proper subset of R i, which does not contain any ey from K i. The tuple integrity constraint τ(n i ) is a pair, whose first component contains attribute domain constraints of each attribute A R i. The second component of the pair is a logical expression defined over the attributes from R i and their domain values. K p (R i ) is a primary ey. The brief explanation of the relation constraint specifications is beyond the scope of the paper, and it can be found in [1], [2] and [3]. Role(P, N i ) is a set of relation scheme roles and defines the operations that may be performed on an instance of the relation scheme N i. Only these operations may be built into a transaction program made using the concepts of a subschema P. A set of relation scheme roles is a nonempty set, for which Role(P, N i ) {r, i, m, d} holds, where: r stands for data reading, i.e. referencing, i for insert, m for modification and d for data deleting. A subschema P is intended for database querying only if ( N i S )(Role(P, N i ) = {r}) holds. Otherwise, it is intended for updating, and querying. The set Mod(P, N i ) contains those attributes of the relation scheme N i that may be modified. If m Role(P, N i ), then Mod(P, N i ) must not equal. A subschema is a part of a program specification. The transaction program specification should be designed in a way to provide a development of a functionally correct transaction program. More information concerning the notion and the role of a subschema may be found in [1], [2], [3], [4] and [5]. Example 1. Let subschema P 1 is associated to a business process whose tas is to control domestic orders. It is aimed for entry of domestic orders and shipments within one transaction. In this process, users are not interested in customer data except in customer id. The specification of subschema P 1 follows. S 1 ={Order_d_s(R 1 1, C 1 1,...), Shipment(R 1 2, C 1 2,...)}; R 1 1 = {OrdId, OrDate, CustId, Origin, Total}; C 1 1 = ({{OrdId}}, τ( Order_d_s), ), K p (R 1 1 ) = {OrdId}; R 1 2 = {ShipId, OrdId, ShipDate, ShipTotal}; C 1 2 = ({{ShipId}}, τ(shipment), ); K p (R 1 1 ) = {ShipId}; Role(P 1, Order_d_s) = {i, r}, Role(P 1, Shipment) = {i, r}; Mod(P 1, Order_d_s) = Mod(P 1, Shipment) = ; I 1 = {Shipment[OrdId] Order_d_s [OrdId], Order_d_s [OrdId] Shipment[OrdId]}; The domain for attribute Origin in the relation scheme R 1 1 is dom(origin) = {d}. Its only one value d denotes domestic orders. P 1 also contains an inclusion dependency Order_d_s[OrdId] Shipment[OrdId]. These two constraints denote that the process related to the subschema P 1 deals only with shipped orders issued by a domestic customer. A database schema and a subschema contain the following concepts: A set of database schema attributes: U = U R j ; N S A set of subschema attributes: U = Sets of relation scheme attribute sets: U N i S j R i ; S = {R j j {1,..., t}}, for database schema; and S = {R i i {1,..., n}}, for subschema; and Sets of constraints: t O = I ( j j C ), for database schema; and = 1 n O = I ( C i= 1 i ), for subschema. The sets S and O determine the set of subschema P instances, whereas the sets S and O determine the set of database schema (S, I) instances. In the paper, O + will denote the set of all logical consequences of a set of constraints O and O + X will denote all the constraints from O +, which are defined using only the attributes from X.

III. ON THE SAFE DATABASE UPDATES A potential database schema is created by the integration of the designed set of subschemas. It is the relational database schema, which consists of a set of relation schemes and a set of interrelation constraints. The set of relation schemes is generated by the synthesis algorithm. The sets of relational and interrelational constraints are generated using the appropriate sets of subschema constraints. The process of simultaneous and independent design of each subschema may lead to the collisions in expressing the real system constraints and business rules, in the different subschemas. If the collisions between the different subschemas exist, then some of the subschemas are not consistent with the potential database schema in a formal sense. Consequently, the programs made over the inconsistent subschemas do not allow safe database updates, i.e. their using may lead to logically incorrect database updates. Accordingly, such a potential schema must not be considered as a resulting database schema, after the integration of the set of subschemas. The explanation of the notion of a safe database update follows. A subschema is a description of the data of a relatively small part of the database. Each relation scheme of a subschema may be considered as a view on a single database relation scheme. Subschema instances are not materialized. A subschema instance may be obtained as a result of the applying appropriate join, select and project operations on a database instance. A transaction program issues queries and updates that are executed by a database management system (DBMS). Let T be a transaction program based on subschema concepts, and let T be a transaction program that is equivalent to T, but based on database schema concepts. To consider database updates initiated by T as safe updates, the subschema and the database schema should satisfy the following two conditions at the abstraction level of instances. 1. A unique (hypothetical) subschema instance, named the corresponding subschema instance, may be produced by applying the appropriate relational join, project and select operations on a database schema instance; and 2. If an update of a hypothetical subschema instance issued by T would be successful, then T must be committed by DBMS. If a subschema is intended for queries only, it has to satisfy only Condition 1. In the paper, the aforementioned conditions are called the principles of a database update using subschema concepts. Their formal definition at the abstraction level of instances is given in [2] and [12]. A subschema that satisfies these conditions is said to be consistent with the corresponding database schema. Let P be the set of subschemas and let the potential database schema be the result of the integration of subschemas from P. The potential database schema, which is consistent with all of the subschemas from P, may be declared as a database schema. Example 2. Let us consider the subschema P 1 from Example 1, and a potential database schema (S, I). Let S be the result of the integration of P 1 and some other subschemas. The structure of those subschemas is not relevant in the example. S ={ORDER(R 1, C 1,...), SHIPMENT(R 2, C 2,...), CUSTOMER(R 3, C 3,...)}; R 1 = {OrdId, Ordate, CustId, Origin, Total}; C 1 = ({{OrdId}}, τ(order), ); K p (R 1 ) = {OrdId}; R 2 = {ShipId, OrdId, ShipDate, ShipTotal}; C 2 = ({{ShipId}}, τ(shipment), ); K p (R 2 ) = {ShipId}; R 3 = {CustId, CustName, CustAdrr}; C 3 = ({{CustId}}, τ(customer), ); K p (R 3 ) = {CustId}); I = {ORDER[CustId] CUSTOMER[CustId], SHIPMENT[OrdId] ORDER[OrdId], CUSTOMER[CustId] ORDER[CustId]}. P 1 and (S, I) satisfy Condition 1. For each relation scheme N 1 from P 1, there is a corresponding relation scheme N from S, such that R 1 R holds, where R 1 and R are the attribute sets of N 1 and N, respectively. The corresponding relation scheme for Order_d_s is ORDER, and SHIPMENT is the corresponding scheme for Shipment. Let T 1 be a transaction program based on the concepts of subschema P 1, aimed at insertion of tuples into an instance of Order_d_s. Let T be a transaction program that is equivalent to T 1, but based on the database schema concepts. T is aimed at tuple insertion in an instance of OR- DER, that is corresponding to an instance of Order_d_s. T 1 allows the insertion of a tuple with any domain value for the attribute CustId. Suppose that an instance over Order_d_s is successfully updated by T 1, in such a way that the database relation CUSTOMER does not contain a tuple with the given CustId value. However, the set of database constraints contains the referential integrity constraint ORD- ER[CustId] CUSTOMER[CustId]. The transaction program T would reject the equivalent transaction over the database instance. Otherwise, this constraint would be violated. Thus, there is an example of the successful update of a hypothetical subschema instance executed by T 1, which would not be committed by a DBMS. Consequently, P 1 is not consistent with the potential database schema (S, I), and (S, I) cannot be declared as a database schema. IV. THE FORMAL CONSISTENCY A subschema and a database schema are formally consistent if: 1. The set of attributes, for each subschema relation scheme, is a subset of the corresponding relation scheme attribute set; 2. Each set of attributes X with a unique value property (as it is defined in [1] and [2]) in a subschema relation scheme has the same property in the corresponding database relation scheme; and

3. All the constraints that can be inferred from the database schema and that are relevant for the subschema are embedded into it. A formalization of the first and the second condition can be found in [2]. Their satisfying is a prerequisite for the validation of the third condition, which can be expressed by the logical implication: (1) O = O r P, where O is the set of all constraints of the subschema P, and O r P is the set of all database schema constraints that are relevant for P. The most important components of the specification of a constraint o O are expressed by a set T(o) in the following way: T(o) = {(N 1, ρ 1, At 1, {(op 1 i 1, act 1 i 1 ) i 1 1}),..., (N m, ρ m, At m, {(op m i m, act m i m ) i m 1})}. In the four-tuple (N j, ρ j, At j, {(op j i j, act j i j ) i j 1}), N j is the name of a relation scheme that is spanned by o, ρ j {referenced, referencing,...} is the role of N j in o, At j is a set or sequence of attributes from R j that are relevant for o, and {(op j i j, act j i j ) i j 1} is a set of pairs (critical operation, activity). An attribute A is relevant for o if o is used to chec values of A. An operation op j i j {insert, delete, update} is a critical if it can violate a constraint and act j i j {NoAction, Cascade, SetDefault, SetNull} is an activity for preserving data consistency in an attempt of its violation. A constraint o should belong to the set of relevant constraints for subschema P (S, I ), if the operation that might violate o is allowed in P. There are two inds of relevant constraints: The inclusive, denoted by Ini(O, P ); and The extensible, denoted by Exi(O, P ). Suppose a constraint o O + is relevant for subschema P. that satisfies Conditions 1. and 2. The constraint o belongs to Ini(O, P ) if it can be expressed by the concepts of subschema P. It means that for each relation scheme N j appearing in T(o) there is a subschema relation scheme N i S, such that R i R j and At j R i hold. A constraint o belongs to Exi(O, P ) if and only if it is relevant for P, and o Ini(O, P ) holds. Definition 1. A subschema P is formally consistent with a database schema if Conditions 1. and 2. hold, and if the following conditions are satisfied: (2) O = Ini(O, P ), (3) Exi(O, P ) =. It is proved in [2] that the formal consistency of a subschema and a database schema is the necessary condition for the satisfaction of the update principles. It leads to the conclusion that a database schema design process should adhere to formal consistency conditions to assure the design of the consistent subschemas and formally correct database schema. V. THE INTEGRATION OF SUBSCHEMAS A general solution of the implicational problem in the presence of different constraint types is very hard to find, if even possible. Testing the satisfaction of the formal consistency may be relaxed by considering the implicational problem for various constraint types separately. We believe that this approach relaxed in that way, may lead to a good database schema design practice. After the potential database schema is created, the automatic detection of the collisions is performed. Then, the designers are directed to redesign formally incorrect external schemas. The process is iterative. It will stop when the collisions do not exist any more. The procedure of the database schema integration is outlined in Figure 1. The set of external schemas The set of subschemas Information requests The specifications (diagrams) of processes and dataflows Potential database schema NO The formal consistecy? Implementation database schema The definition of the database schema and internal schema in DDL The set of conceptual specifications of transaction programs and aplications YES Internal database schema The set of implemention specifications of transaction programs and aplications Figure 1. The elements of the database schema integration VI. AN ALGORITHM FOR THE DETECTION OF INCONSISTENCIES A common algorithm for the detection of constraint inconsistencies between the subschema and the database schema on the level of the same constraint type, named DCI algorithm, is presented in Figure 2. Ini(O, T, P ) denotes the set of inclusive relevant constraints, of the specified constraint type T, for subschema P, while Exi(O, T, P ) denotes the set of extensible relevant constraints of the type T, for P. Ini(O, T, P ) contains database constraints that can be expressed by the concepts of subschema P. To each o Ini(O, T, P ), the function tc(o, P ) associates a corresponding constraint, expressed by means of the concepts of the subschema P.

THE ALGORITHM FOR THE DETECTION OF CONSTRAINT INCONSISTENCES Input: T, A constraint type {O T P P}, O T - a set of globally valid constraints of the subschema P of a given type T; P - a set of all subschemas. O T A set of database schema constraints, of a given type T Output: Po, a set of potentially inconsistent constraints Psh, a set of triples (P, o, A) P - potentially inconsistent subschema, o - the database schema constraint, causing P being potentially inconsistent, A {T, }, T if P contains o; - if P does not contain o Error, a Boolean indicator, T the process of the design has to be stopped; the process is going on BEGIN PROCESS consistency_checing Po Psh DO subschema_checing ( P P) DO ini_constraint_checing ( o Ini(O, T, P )) IF tc(o, P ) (O T ) + THEN Po Po {o} Psh Psh {(P, o, )} END DO ini_constraint_checing DO exi_constraint_checing ( o Exi(O, T, P )) Po Po {o} Psh Psh {(P, o, )} END DO exi_constraint_checing END DO subschema_checing IF Po = THEN Error ELSE DO if_inconsistent_subschema ( P P) DO for_constraint ( o Po \ {o Po (P, o, ) Psh}) IF o O T S THEN IF tc(o, P ) (O T ) + THEN Psh Psh {(P, o, T)} END DO for_constraint END DO if_inconsistent_subschema Error T END PROCESS consistency_checing Figure 2. The pseudo code of DCI Algorithm A subschema P is potentially inconsistent with the database schema if: (4) ( o Ini(O T, T, P ))( (O T = o)); or (5) Exi(O T, T, P ). A database constraint o for which o Ini(O T, T, P ) and (O T = o) holds, or o Exi(O T, T, P ) holds, for any subschema P, is potentially inconsistent. For each potentially inconsistent constraint, the designer has to decide if it should be embodied into the database schema. If the decision is positive, the potentially inconsistent constraint must be embodied into all the subschemas, for which it is relevant. Otherwise, a potentially inconsistent constraint must not be embodied into the set of database constraints. It must be emphasized that subschema constraints may be stronger, but not weaer than the corresponding database constraints. Consequently, some of the subschema constraints may not be embodied into the database schema. A subschema constraint is considered as locally valid if it is embodied into the subschema, but it must not be embodied into a database schema. Subschema constraints that are embodied into a database schema are considered as globally valid. Let us consider a potentially inconsistent constraint and the subschema into which it has already been embodied as a relevant one. There are two possible solutions: the potentially inconsistent constraint may be excluded from the subschema; or it may be pronounced as a locally valid constraint for the subschema. In the first step of the integration process, all subschema constraints are pronounced as globally valid. In the subsequent iterations some of them may be pronounced as locally valid. Finally, it can be concluded that there are three possible inds of relationships between a subschema P and a potentially inconsistent database constraint o. A potentially inconsistent constraint o is not relevant for P, and consequently, P is not potentially inconsistent with a database schema, with respect to o. The designer need not redesign the subschema P, but probably need redesign some other subschema. A potentially inconsistent constraint o is relevant for P, but it is not embodied into the set of subschema constraints of P. P is potentially inconsistent and the designer may redesign it by embedding o into its set of constraints. A potentially inconsistent constraint o is relevant for P and it is embodied into the set of subschema constraints of P, but there is some other P l, for which o is also potentially inconsistent, but not embodied into it. P has "carried" o into the set of database constraints. Accordingly, P is potentially inconsistent. A designer may redesign it by excluding o from its set of constraints or by pronouncing o as a locally valid constraint for the subschema P. VII. CONCLUSION The most important advantages of the presented concept of a database schema design using the subschemas are: A set of user requests is divided into the groups of similar end users business tass to reduce the complexity of

real system and to overcome the limited perception power of designers; For each of those tas groups, an external schema is designed using a data model convenient for conceptual design, to relieve the problem of the defining a set of attributes and a set of constraints that faithfully represent a real system and its business rules; A subschema is defined using the concepts of the relational data model, according to the appropriate external schema. The potential database schema is created by the integration of the designed set of subschemas. Relational data model enables the automatization of the process of subschema integration. Consequently, the designers should concentrate on the semantics concerning the design of the external schemas, rather than on formal problems and complex techniques of the design of an integrated database schema. The relationship between a database schema and a subschema is formalized by the notion of database update principles and the formal consistency of a subschema and the database schema. The formal consistency is a necessary condition for the database update principles. A subschema is a component of the transaction program specification. Thus, the process of its design should adhere to the formal consistency conditions. One of the consequences is that the set of subschema constraints must imply all those database schema constraints that might be violated by the allowed update operations of the subschema. Checing the formal consistency is relaxed by considering the implicational problem for each constraint type separately. The common algorithm for the detection of constraint inconsistencies between the database schema and a subschema, presented in the paper, operates on the level of the same constraint type. A future wor should lead towards: the sufficient conditions that will imply the database update principles; and some specific algorithms for different constraint types based on the common algorithm, presented in the paper. VIII. REFERENCES [1] Luović I., Mogin P., Govedarica M., Ristić S., "The Structure of A Subschema and Its XML Specification", in Proceedings of the XIII International Conference on Information and Intelligent Systems, Varaždin, Croatia, September 2002, pp. 45-56. [2] Ristić S., A Research of Subschema Consolidation Problem, PhD Thesis, University of Novi Sad, Faculty of Economics, Subotica, Yugoslavia, 2003. [3] Mogin P., Luović I., Govedarica M., Database Design Principles, University of Novi Sad, Faculty of Technical Sciences & MP "Stylos", Novi Sad, Yugoslavia, 2000. [4] Luović I., Mogin P., "On The Role of Subschema as A Component of The Implementation Specification of A Program", in Proceedings of the VI Symposium on Computer Science and Information Technologies YUINFO, Kopaoni, Yugoslavia, March 2000, on CD ROM. [5] Mogin P., Luović I., "An Approach to Database Design", International Journal of INDUSTRIAL SYS- TEMS, Vol. 1, No. 2, Novi Sad, Yugoslavia, December 1999, pp. 59-68. [6] Codd E. F., The Relational Model for Database Management Version 2, Addison-Wesley-Publishing- Company, USA, 1990. [7] Langera R., "View Updates in Relational Databases with An Independent Scheme", ACM Transactions on Database Systems, Vol. 15, No. 1, 1990, pp. 40-66. [8] Dayal U., Bernstein P., "On the Correct Translation of Updates on the Relational Views", ACM Transactions on Database Systems, Vol. 8, No. 3, 1988, pp. 339-365. [9] Bancilhon F., Spyratos N., "Update Semantics of Relational Views", ACM Transactions on Database Systems, Vol. 6, No. 4, 1981, pp. 557-575. [10] Luović I., Mogin P., Ristić S., "A Database Schema Design Using The Subschemas", in Proceedings of the XII International Conference Industrial Systems IS 2002, November 22-23, 2002, Vrnjaca Banja, Yugoslavia, pp. 340-347. [11] Ristić S., Luović I., Mogin P., "The Detection of Database Constraint Inconsistencies", in Proceedings of the XII International Conference Industrial Systems IS 2002, November 22-23, 2002, Vrnjaca Banja, Yugoslavia, pp. 348-353. [12] Ristić S., Mogin P., Luović I., "Specifying Database Updates Using A Subschema", in Proceedings of the VII International Conference On Intelligent Engineering Systems INES 2003, March 4-6, 2003, Assiut- Luxor, Egypt, pp. 203-212.