3 Data Models and Data Sublanguages

Size: px

Start display at page:

Download "3 Data Models and Data Sublanguages"

Cameron Evans
6 years ago
Views:

1 3 Data Models and Data Sublanguages 3.1 INTRODUCTION The data model and the associated data sublanguage are the basic means by which data independence is achieved: the data model (DM) is the user's view of what is in the database, and the data sublanguage (DSL) is the user's language for transferring data between the data model and his workspace. (For simplicity we shall ignore the data submodel throughout this chapter.) The question is: What form should the DM and accompanying DSL take? There are currently three favoured approaches to this problem. 1. The relational approach 2. The hierarchical approach 3. The network approach In this chapter we shall investigate these three approaches, at least in outline. That is, we shall explain the three types of DM and consider in each case the features which must be available in the accompanying DSL. Note, however, that each of the three will be covered in considerably more detail in later parts of the book. As explained in the preface, the emphasis of this book is on the relational approach. The present chapter is intended to demonstrate some (not all) of the advantages which are claimed for this approach. In fairness to the other two approaches, however, the major disadvantage should also be stated here (it was mentioned in the preface). It is simply this: As yet, no large-scale implementations exist; therefore, we lack practical experience in the effectiveness of the relational approach. The other two approaches, for which many implementations do exist, are to some extent the historical consequence of tradeoffs which had to be made between what was desirable at the user interface and what could be efficiently implemented. However, as explained in the preface, it now seems feasible to produce an efficient implementation of the relational approach, too. Thus this approach may be viewed not only as a useful basis for understanding database systems in general but also as a serious candidate for future implementation. (The utility of such an implementation would, of course, depend on considerations of performance and compatibility with installed systems, over and above the aspects discussed in this book.) For ease of presentation, the relational model and the corresponding DSL are treated separately in Sections 3.2 and 3.6, respectively. 3.2 THE RELATIONAL MODEL The relational approach is based on the mathematical theory of relations. This clearly provides a sound theoretical foundation; it means that all the results of relation theory may be applied directly to such problems as the design of the DSL (for example). On the other hand, it is perhaps slightly unfortunate that the terminology employed is taken directly from the theory, so that in some places the user is faced with having to learn new terms for familiar concepts. However, this should not prove a major barrier to understanding what is, after all, a very simple set of ideas. In mathematics the term relation may be defined as follows: Given sets D 1,D 2,...,D n. (not necessarily distinct), R is a relation on these n sets if it is a set of ordered n-tuples <d 1, d 2,...,d n. > such that d 1 belongs to D 1, d 2 belongs to D 2,...d n belongs to D n. Sets D 1,D 2,...,D n are called the domains of R. The value n is called the degree of R. Figure 3.1 illustrates a relation called PART which is defined on domains P# (part number), PNAME (part name), COLOR (part color) and WEIGHT (part weight). 1

2 PART P# PNAME COLOR WEIGHT MScrew NReden P1 Nut Red 12 P2 Bolt G reen 17 P3 Screw Blue 17 P4 14 P5 Cam Blue 12 P6 Cog Red 19 Fig. 3.1 The relation PART COMPONENT MAJOR-P# MINOR-P# QUANTITY P1 P2 2 P1 P4 4 P5 P3 1 P3 P6 3 P6 P1 9 P5 P6 8 P2 P4 3 Fig. 3.2 The relation COMPONENT The domain COLOR, for example, is the set of all valid part colors. The degree of PART is 4. As Fig. 3.1 illustrates, it is convenient to represent a relation as a table, each row representing one n-tuple. (Henceforth we shall. generally refer to tuples 1, dropping the n-prefix, unless there is a chance of ambiguity.) The PART consists of 6 tuples (4-tuples in this case), shown as 6 rows. It is important to observe the following. 1. No two rows (tuples) are identical. 2. The ordering of rows (tuples) is insignificant. Both these points are immediate consequences of the fact that a relation is a set. Strictly speaking, the ordering of columns is significant, representing as the ordering of the underlying domains. However, we shall always any individual column by the appropriate domain name, never by its relative position. With this proviso we may make another statement. 3. The ordering of columns is insignificant. A difficulty arises when two or more of the underlying domains are actually the same set, so that two or more columns have the same domain name associated with them. In such a case we distinguish the separate roles played on the domain by prefixing each appearance of the common data with a distinct role name, as illustrated in the relation COMPONENT of Fig Here the domains are P# (used twice) and QUANTITY. The role-name prefixes MAJOR... and MINOR... identify the roles played by P# in its two appearances. The significance of a tuple of the relation COMPONENT is that the major part includes the minor part (in the indicated quantity) as an immediate component. To return to our first example, the domain P# is the primary key of the PART relation. This means that each PART tuple contains a distinct P# value, and that this value may be used to distinguish 1 Usually pronounced to rhyme with couples. 2

3 that tuple from all others in the relation. For COMPONENT the primary key consists of MAJOR_P# and MINOR_P# in combination; that is, a major and a minor part number together are required to identify a unique COMPONENT tuple. In general, the primary key of a relation may involve any number of domains (qualified by role name if necessary), from 1 to n (the degree of the relation). We assume that the primary key is nonredundant; i.e., none of its constituents is superfluous for the purpose of unique identification-thus, for example, (P#, COLOR) would not be a nonredundant primary key for PART. Note that the primary key always exists, by virtue of property 1 above 2. Finally, we shall make an assumption. 4. Every value within a relation - i.e., each domain value in each tuple - is an atomic (nondecomposable) data item (e.g., a number or a character string). To put it another way, at every row-and-column position within the table there exists precisely one value, never a set of values. (We allow the possibility of null values, e.g., for "maiden name" of a male employee.) A relation satisfying property 4 is said to be normalized. It is a trivial matter to cast an unnormalized relation into normalized form. A single example will suffice to illustrate the procedure. Relation BEFORE (see Fig. 3.3) is defined on domains S# (supplier number) and PQ (part-quantity); the elements of PQ are themselves relations defined on domains P# (part number) and QTY (quantity), and thus BEFORE is unnormalized 3. Relation AFTER is an equivalent normalized relation. (The meaning of each of these relations is that the indicated suppliers supply the indicated parts in the indicated quantities.) BEFORE PO S# P# QTY S1 P1 3 P2 2 P3 4 P4 2 P5 1 P6 1 S2 P1 3 P2 4 S3 P3 4 P5 2 S4 P2 2 P4 3 P5 4 S5 P5 5 Figure 3.3a An example of normalization 2 In some relations there may be several candidates for the primary key. This would be so with PART, for example, if part names were always unique. We would then arbitrarily choose one of the candidates as the primary key for the relation. 3 BEFORE is actually a simple example of a hierarchy. See Section

4 We can now define the relational model as a user view of a database as a collection of (time-varying) normalized relations of assorted degrees. It is necessary to specify "time-varying" to allow for the insertion, deletion, and modification of tuples. We can see that, in traditional terminology, a relation corresponds to a (homogeneous) file, a domain to a (single-valued) field, and a tuple to a data model record (occurrence). These correspondences are somewhat approximate, however (for example, files are usually sequenced, whereas a relation is not). AFTER S# P# QTY S1 P1 3 S1 P2 2 S1 P3 4 S1 P4 2 S1 P5 1 S1 P6 1 S S2 P2 4 S3 P3 4 S3 P5 2 S4 P2 2 S4 P4 3 S4 P5 4 S5 P5 5 Figure 3.3b An example of normalization This concludes our discussion of what the relational model is. The associated question of a data sublanguage for the relational model is deferred to Section 3.6. Before then we shall examine the hierarchical and network approaches in some detail, using the sample data presented in Section 3.3 as a basis for all examples. 3.3 SAMPLE DATA In its relational form the sample data consists of three normalized relations: S (supplier), P (part), and SP (supplier-part). Relations P and SP are actually relations PART and AFTER of Figs. 3.1 and 3.3, respectively; S consists of the supplier data of Fig. 2.2 (which we can now see is a relational view of the data used as the basis of the examples in Chapter 2). For convenience, all three relations are shown in Fig The primary keys for the sample data are S# (for S), P# (for P), and the combination S # and P # (for SP). Note that a tuple of relation SP represents an association between a supplier and a part. This is how such associations are handled in the relational approach, i.e., by means of a "linking" relation (but note that there is no real difference between a "linking" relation and an "ordinary" relation). The "entity" represented by an SP tuple is the supplier-part link. See the concluding remarks in Section Note, incidentally, that the two appearances of S# (in S and in SP) really do represent two uses of the same domain; similarly for P#. The essential feature of this example is that there is a many-to-many correspondence between the two principal types of entity involved, namely, suppliers and parts (each supplier supplies many parts, and each part is supplied by many suppliers, in general). We select this example as being typical of a wide class of application areas (e.g., courses and students in an education system, customers and items in an order-entry system, species and locations in an ecological survey, and so on). Note, moreover, that frequently there will be several types of entity involved, not just two; e.g., there may be courses, students, and supervisors in an education system. It is easy to extend what follows to cover such cases. 4

It should also be pointed out that there are many applications in which the correspondence between entity types is not many-many but merely onemany (or possibly even one-one)-e.g.

However, we may consider this type of situation as a special case of the more general one typified by our suppliers-and-parts example. 3.

5 It should also be pointed out that there are many applications in which the correspondence between entity types is not many-many but merely onemany (or possibly even one-one)-e.g., departments and employees in a personnel records system (assuming that each employee works in one and only one department). However, we may consider this type of situation as a special case of the more general one typified by our suppliers-and-parts example. 3.4 THE HIERARCHICAL APPROACH For historical reasons this approach is very popular; it is used in many existing database systems, including, for example, IBM's Information Management System (IMS) [ It has its origins in the storage structures which were prevalent when most data processing was performed with purely sequential media, and there was only a minimal distinction between the data model and the storage structure. For our suppliers-and-parts example it would be possible to present the user with a hierarchical model in which suppliers are superior to parts, as illustrated in Fig

6 This model would permit the user to see five hierarchical occurrences, one for each supplier (only those for suppliers S2 and S4 are shown in Fig. 3.5). Each occurrence consists of one supplier "segment" occurrence-to use IMS terminology-together with one part "segment" occurrence for each part supplied. Note that each part occurrence includes the appropriate quantity value. The unit of access-i.e., the smallest amount of data which may be transferred by one DSI, statement-in a hierarchical data model such as this one is normally the segment occurrence. It is fundamental to the hierarchical view that any given segment occurrence takes on its full significance only when seen in its context-indeed, no segment occurrence can exist without its superior 4. Thus, to retrieve a particular part occurrence, for example, the user must state not only which part he is interested in but also under which supplier that part is to be found. The DSL, will therefore include a statement - "get unique" in IMS - to retrieve any segment occurrence directly, provided the user supplies sufficient information in the statement to identify the entire hierarchical path involved. Another statement "get next" in IMS-will permit the user to move forward from his current position (i.e., the segment occurrence last accessed) to retrieve the next segment occurrence in sequence, or possibly the next satisfying some condition specified in the statement. [Note here, incidentally, that the sequence of data in the data model may not be the sequence the user desires; most systems give the user no (dynamic) control over this.] The major advantage of the hierarchical approach is that it obviously provides a very natural way of modeling a hierarchical structure from the real world. However, difficulties arise when we try to operate on such a model using a DSL such as that outlined above. By way of illustration, consider the following sample queries (Fig. 3.6) on the suppliers-and-parts data model and the DSL statements required to answer them. Q1: Find part numbers for parts supplied Q2: Find supplier numbers for suppliers by supplier S2. who supply part P2. Get unique supplier with S# = S2 Get to start of data. Next: Get next part for this supplier. Next: Get next supplier. Part found? If not, exit. Supplier found? If not, exit. Print P#. Get next part for this supplier Go to Next. with P# = P2. Part found? If not, go to Next. Print S#. Go to Next. Fig. 3.6 Two sample queries on the hierarchical model We can see that, even though the two original queries are symmetric, in the sense that one is the inverse of the other, the DSL procedures required to answer them are certainly not symmetric. This illustrates one of the drawbacks of the hierarchical model, unnecessary complexity. Specifically, the user is forced to devote time and effort to solving problems which are introduced by the model and are not intrinsic to the questions being asked. It is clear that matters will rapidly become worse as new types of segment are introduced and the hierarchy becomes more complex. This is not a trivial matter. It means that programs (assuming programmer-level users) are more complicated than they need be, with the consequence that program writing, debugging, and maintenance will all require more programmer time than they should. The hierarchical model for the suppliers-and-parts example also suffers from a number of anomalies in connection with storage operations (adding, deleting, updating). These are a consequence of the fact that we are dealing with a situation involving a many-to-many correspondence, whereas hierarchies by 4 Occurrences of the "root" segment, i.e., the segment at the top of the hierarchy, do of course exist without any superior segment. 6

7 definition really cater for one-to-many situations; the difficulties would not arise in a "genuine" (one-to-many) hierarchical situation 5. The anomalies are noted briefly here. Adding. It is not possible, without introducing a special dummy supplier, to insert data concerning a new part-p7, say-until some supplier supplies it. Deleting. If we delete the only supplier of a particular part, data concerning that part is lost, too. For example, deleting supplier SI causes all information on part P6 to disappear, because deletion of any segment occurrence automatically causes all subordinate segment occurrences to be deleted, too (in keeping with the hierarchical philosophy). Updating. If we need to change the description of a part-e.g., to change the color of part P2 to yellow-we are faced with either the problem of searching the entire data model to find every occurrence of part P2 or the possibility of rendering the data model inconsistent (part P2 might be given as yellow in one place, green in another). Incidentally, it is for reasons similar to these (among others) that we do not permit unnormalized relations in the relational model. To conclude this section, we should in fairness make some mention of the "logical database" facilities of IMS (discussed in detail in Chapter 14). Briefly, logical databases allow IMS to overcome many of the difficulties discussed above (retrieval complexity, storage anomalies). However, they are very definitely a feature of a particular implementation (IMS), not of the hierarchical approach as such. In this chapter we are concerned only with what might be termed the basic hierarchical approach. 3.5 THE NETWORK APPROACH The network approach is typified by the system proposed by the Data Base Task Group (DBTG) of CODASYL [ Figure 3.7 shows how (part of) the suppliers-and-parts example might be represented in this system. The nodes of a DBTG network are individual record occurrences (and the record occurrence is the unit of access). A network is a more general structure than a hierarchy because a given node may have any number of immediate superiors (as well as any number of immediate subordinates) we are not limited to a maximum of one, as we are with a hierarchy. This enables us to represent a any-to-many correspondence in a reasonably direct manner, as the example illustrates. In addition to the record occurrences (nodes) representing the suppliers and the parts themselves, we introduce a third type of record which we may call the link. A link record occurrence represents the connection between one supplier and one part, and contains data describing the connection (in the example, the quantity of the part supplied). All link occurrences for a given supplier are placed on a chain 6 starting at and returning to that supplier; similarly for all link occurrences for a given part. Figure 3.7 shows the chains for suppliers S2 and S4, and also the chains (incomplete) for parts P1, P2, P4, and P5 (the latter chains are incomplete because additional link occurrences connecting the parts to other suppliers have not been shown). For example, Fig. 3.7 shows that supplier S2 supplies 3 of part P1 and 4 of part P2. (Note, incidentally, that the correspondence between, say, one supplier and the corresponding links is one-to-many, which shows that hierarchies may easily be represented in such a system.) 5 We shall return to the question of presenting a hierarchical view of a many-to-many situation in Section 10.4 and in Chapter 14. It is worth pointing out here that (a) the retrieval problem discussed earlier (illustrated in Fig. 3.6) still arises in a "genuine" hierarchical situation, and (b) even "genuine" hierarchical situations tend to develop into more complex many-to-many situations with time. 6 These chains may be physically represented in storage by actual chains or by some functionally equivalent method (see [15.91). However, the user may always think of the chains as physically existing, regardless of the actual implementation. 7

8 8

9 The data sublanguage for the network model must obviously permit the user to traverse the various connecting chains. In the DBTG system this is handled by means of various forms of the "find" statement. In addition, it must be possible to access at least some nodes directly, to provide the necessary startpoints for subsequent chain-traversing operations; again, this function is provided by another form of) the D1ITG "find" statement. A get" statement will retrieve the record occurrence just found. Now let us consider the queries Q 1 and Q2 of Fig. 3.6 and the DSL statements required to answer them in the network approach. See Fig Q1: Find part numbers for parts supplied by supplier S2. Q2: Find supplier numbers for suppliers who supply part P2. Find supplier with S# = S2. Find part with P# = P2. Next: Find next link for this Next: Find next link for this part. supplier. Link found? If not, exit. Link found? If not, exit. Find part for this link. Find supplier for this link. Get part. Get supplier. Print P#. Print S#. Go to Next Go to Next J Fig. 3.8 Two sample queries on the network model The DSL statements shown in Fig. 3.8 require a word of explanation. Consider Q1. The first "find" is the direct access form. The second (labeled "Next") is an example of a chain-traversing statement; each time it is executed it moves to the next link occurrence in sequence along the chain emanating from supplier S2. The third "find" is a different chain- traversing case; starting from the current link, it looks along the other chain to find the corresponding part. The "get" then retrieves that part. Similarly for Q2. We can see that with the network approach, symmetric questions require symmetric answers - an advantage over the basic hierarchical approach. (Perhaps we should point out, however, that there are strategy problems with such questions as "Find the quantity of part P2 supplied by supplier S2." Do we start from supplier S2 and traverse its chain, or from part P and traverse its chain? Ideally this should not be the user's decision.) The network model of Fig. 3.7 also overcomes all the difficulties encountered with storage operations in the basic hierarchical model, as noted below. Adding. It is trivial to add a new part - say, part P7. Initially there win be no links for the new part; its chain will consist of a single pointer from the part to itself. Deleting. We can delete supplier S1 without losing part P6 - though what should happen to the links for supplier S1 is a moot point. Updating. We can change the color of part P2 to yellow without search problems and without the possibility of inconsistency, because the color of part P2 appears at precisely one place in the model. To be fair, however, the difficulties do not disappear simply because the network approach but rather because of the particular form the network takes. The problem is really one of normalization, the details of which are beyond the scope of the present chapter (see Chapter 6). The major disadvantage of the network model is simply that it is too close to a storage structure. The user has to be thoroughly aware of which chains do and do not exist, and his DSL programming rapidly becomes extremely complex. (For example, consider what the user must do to adjust the data model if the quantity of part P I supplied by supplier S I is now supplied by supplier S2 instead.) More significantly, the chains are directly visible to the user and hence must be directly represented in storage somehow (see footnote 6). There is thus a risk that the user will become locked into a particular storage structure, contrary to the aim of data independence. 9

10 3.6 A DATA SUBLANGUAGE FOR THE RELATIONAL MODEL Figure 3.9 shows the results of four queries against the relational model. a) Find part numbers for parts supplied by supplier S2 (Q 1 of Figure 3.6). b) Find part names (as opposed to numbers) for parts supplied by supplier S2. c) Find supplier numbers and status for suppliers in London. d) For each part, find part number and names of all cities from which the part may be obtained. The reader should ensure that he agrees with the results shown. Note in particular that in (d) three duplicate (P#,CITY) pairs have been eliminated. These examples illustrate the fact that the result of any query is a set,. in fact, it is a relation, derived in some way from those of the data model. This is always true, even for a very simple query such as "Find the status of supplier ST' (where the result is a relation of degree one containing a single tuple). In general, the result may be extracted from a single data model relation, as in Fig. 3.9 (a), (b), and (c), or it may involve two or more of the data model relations as in Fig. 3.9 (d). (a) P# (d) P# CITY P1 P1 London P2 P1 Paris P2 London (b) PNAME P2 Paris Nut P3 London Bolt P3 Paris P4 London (c) S# STATUS P5 London S1 20 P5 Paris S2 20 P5 Athens P6 London Fig. 3.9 Results of four queries on the relational model (Fig. 3.4) For query purposes, therefore, the DSL should permit the user to specify the relation he wants derived (and retrieved). There are at least two ways in which we might allow the user to do this: (1) he could actually specify the sequence of relational algebra operations to be performed to produce the desired result; (2) he could simply state a definition of the desired result in terms of the relational calculus, leaving the system to determine which operations are necessary. The difference between the two approaches is analogous to the difference between (1) actually constructing a set by performing a sequence of set operations (union, intersection, etc.), and (2) simply stating the "defining property" of the set in the form of a predicate; in other words, it is the difference between procedurality and nonprocedurality. It is thus possible to provide a procedural (algebra-based) or nonprocedural (calculus-based) DSL, for the relational model. Relational algebra A relational algebra operation [3.1,3.31 is an operation which takes one or more relations as its operand(s) and produces a relation as its result. The only operations we shall discuss here are projection and join; we shall content ourselves with an informal definition and some examples of each. To project a relation over specified domains, we strike out the domains (columns) not required and remove redundant duplicate tuples (rows) from what remains. Figure illustrates (a) the projection of S over (STATUS, CITY) and (b) the projection of P over P#. 10

11 Two relations with a common domain, D, can be joined over that domain. The result is a relation in which each tuple consists of a tuple from the first relation concatenated with a tuple from the second relation which contains the same D-value (except that we eliminate one of the two identical D-values). Figure 3.1 l(a) shows the join of S and SP over S # (the domains have been reordered for clarity). (a) STATUS CITY (b) P# 20 London P1 10 Paris P2 30 Paris P3 30 Athens P4 P5 P6 Fig Two sample projections (a) CITY STATUS SNAME S* P# CITY (b) S# P# QTY London 20 Smith S1 P1 3 S2 P1 3 London 20 Smith S1 P2 2 S2 P1 4 London 20 Smith S1 P3 4 London 20 Smith S1 P4 2 London 20 Smith S1 P5 I London 20 Smith S1 P6 1 Paris 10 Jones S2 P1 3 Paris 10 Jones S2 P2 4 Paris 30 Blake S3 P3 4 Paris 30 Blake S3 P5 2 London 20 Clark S4 P2 2 London 20 Clark S4 P4 3 London 20 Clark S4 P5 4 Athens 30 Adams S5 P5 5 Fig Two sample joins C1 S# C2 CITY S2 London Fig Two constant relations If a value in the common domain appears in one relation and not in the other, tuples containing that value do not participate in the join. Figure 3.11(b) shows the join of two relations over the domain S#. One is the relation SP; the other is a unary (degree 1) relation consisting of the single supplier number S2 (shown as C1 in Fig. 3.12). Note: As defined here, join does not quite correspond to either the natural join of [3.1] or the equi-join of [3.3]. We now show how the queries of Fig. 3.9 may be expressed in a DSL based on the relational algebra. Note that we require the "constant" relations C1 and C2 of Fig a) Join SP and C I over S #. Project the result over P#. b) Join SP and Cl over S#. Join the result and P over P# Project the result over PNAME. 11

12 c) Join S and C2 over CITY. Project the result over (S#, STATUS). d) Join S and SP over S#. Project the result over (P#,CITY). Relational calculus The relational calculus [3.3] is simply a notation for expressing the definition of a relation which is to be derived from the data model. The only feature that may be unfamiliar to some readers is the use of the quantifier symbols ("there exists") and ("for all"). Again we consider the queries of Fig a) {SP.P#: SP.S#= S2 } This may be read, "Retrieve the set of P# values from the SP relation which are such that the corresponding S# value is S2." The braces { } indicate a set definition; the colon stands for "such that"; the expression on the left of the colon indicates what is to be retrieved, and the expression on the right is a qualification (or predicate). Note that this definition is completely nonprocedural. We defer discussion of (b) for a moment. c) {(S.S#, S.STATUS):S.CITY= LONDON } This is straightforward. d) {(SP.P#, S.CITY):SP.S#=S.S#} This may be read, "Retrieve all part-number-and-city pairs such that the part number comes from an SP tuple and the city comes from the S tuple with the same S# value as that SP tuple." Now for (b). b) {P.PNAME: SP(SP.P#=P.P# SP.S#='S2')} This may be read, "Retrieve the set of PNAME values from the P relation which are such that there exists an SP tuple with the same P# value as that in the P tuple and with an S # value equal to S2." One way of thinking about this is to consider each PNAME value in turn and to see whether it satisfies the qualification. Thus the first PNAME value (as listed in Fig. 3.4) is 'NUT'; the corresponding P# value is P1; does there exist an SP tuple with P # equal to P 1 and S # equal to S2? If the answer is yes, 'NUT' is one of the values retrieved. Similarly for the remaining PNAME values. We do not intend to discuss the question of a DSL for the relational model in any more detail here. The relative advantages and disadvantages of the two approaches are argued in [3.3]; the main argument in favor of the algebra is ease of implementation, whereas several arguments (simplicity, easy extensibility, etc.) can be advanced in favor of the calculus. However, it should be obvious that, regardless of the approach chosen, the DSL can provide total data independence, at least so far as retrieval is concerned. As for the storage operations, it is sufficient to observe that, provided the correct normalized relations have been chosen -for the data model, no difficulties arise (the situation is the same as with the network approach). We consider each storage operation briefly. Adding. It is trivial to add a new part-say, part P7. Doing so will involve adding a new tuple to the P relation; initially there will be no SP tuples for this part in the SP relation. Deleting. We can delete supplier Sl without losing part P6 simply by removing supplier S I's tuple from the S relation. The SP tuples for supplier SI will not be deleted, at least in the basic relational approach. 12

13 Updating. We can change the color of part P2 to yellow without search problems and without the possibility of inconsistency, because the color of part P2 appears in precisely one tuple. The question of choosing the right normalized relations is discussed in,chapter 6. Chapters 4 and 5 give fuller descriptions of a calculus- and an algebra-based language; in particular, Chapter 4 contains examples of the use of the quantifier "for all" (V), which was not required for any of the examples above. 3.7 SUMMARY At the end of Section 1. 1 we pointed out that a database system must be able to represent two types of object, namely, "entities" and "relationships." We also pointed out that fundamentally there is no real difference between the two; a relationship is merely a special type of entity. The three approaches (hierarchical, network, and relational) differ in the way they permit the user to view and to manipulate relationships. In the hierarchical approach, relationships are entirely implicit. That is, the relationship between two entities is represented in some way by the relative position of the segment occurrences concerned. For example, the relationship between supplier S2 and part P2 is represented by the fact that a segment occurrence for P2 is subordinate to the segment occurrence for S2 (in the data model of Fig. 3.5). As a consequence, the corresponding data sublanguage relies heavily on the concept of position within the database; in fact, the DSL must provide two types of operation, one for positioning and one for retrieving or storing, although the distinction may not be very clear in a real system. The only way the user can retrieve the information contained in a relationship is by building the information up dynamically as the result of a sequence of WL operations. For example, to retrieve the information that S2 supplies P2, the user must first position himself at the segment occurrence for S2, and then examine the corresponding subordinate segment occurrences to see if one exists for P2. (In practice these two steps may be combined into a single DSL operation, but fundamentally both are required.) In the network approach, on the other hand, relationships are represented explicitly by means of pointers. However, the very fact that pointers are used means that relationships and entities are considered as different things. Again the DSL must provide two types of operation, and again retrieving information from a relationship involves building the information up dynamically. (This is not an argument against using pointers in the storage structure; it is an argument against using pointers to carry information in the data model.) In the relational approach, relationships are again represented explicitly. Here, however, they are represented in exactly the same way as entities, i.e., by means of tuples. In other words, relationships and entities are considered as the same type of object in the relational approach. It is thus possible to provide a WL with a uniform set of operations for manipulating both-an obvious simplification. It can be argued, in fact, that the relational model is a view of the database in terms of its natural structure only: it contains absolutely no consideration of storage and access details such as position or pointers (as Codd [4.11 puts it, it contains no "representation clutter"). As for the DSL, there are two candidates, the calculus and the algebra, both of which are simple and completely data-independent. EXERCISES A database is to contain information about persons and skills. At a particular time the following persons are represented in the database, and their skills are as indicated. Person Skills Arthur Programming Bill Operating and Programming Charlie Engineering, Programming, and Operating Dave Operating and Engineering 13

14 For each person the database contains various personal details, such as address. For each skill it contains an identification of the appropriate basic training course, an associated job grade code, and other information. The database also contains the date each person attended each course, where applicable (the assumption is that attendance at the course is essential before the skill can be said to be acquired). 3.1 Sketch two hierarchical models for this data. 3.2 Sketch a network model for this data. 3.3 Sketch a relational model for this data. 3.4 For each of your answers to the first three questions, give an outline procedure for finding the names of all persons having (a) a specified skill, (b) at least one skill in common with a specified person. REFERENCES AND BIBLIOGRAPHY See also [1.3]. 3.1 E. F. Codd. "A Relational Model of Data for Large Shared Data Banks. CA CM 13, No. 6 (June 1970). The current interest in the relational approach is largely due to the work of E. F. Codd. This is the first of his published papers in the field. It contains an explanation of the relational model (on which Section 3.2 is heavily based) and a definition of some relational algebra operations. A seminal paper. 3,2 E. F. Codd. "Normalized Data Base Structure: A Brief Tutorial." Proc ACM SIGFIDET Workshop on Data Description, Access and ControL Available from ACM. This paper is probably the best starting point for reading on the relational approach. It contains a simple explanation of the relational model, considers its advantages with respect to the hierarchical and network models (somewhat along the lines of the present chapter), and provides an introduction to the concept of further normalization (see Chapter 6) E. F. Codd. "Relational Completeness of Data Base Sublanguages." In Data Base Systems, Courant Computer Science Symposia Series, Vol. 6, Prentice-Hall (1972). This paper provides a rigorous definition of a relational algebra and a relational calculus, and proves that the algebra has at least the retrieval power of the calculus (that is, any relation which may be derived from the data model in one statement of the calculus may also be so derived in one statement of the algebra). An algorithm is presented for converting an arbitrary calculus expression into an equivalent algebraic expression. The two approaches are compared and contrasted as candidates for a data sublanguage. Storage operations are not considered. 14

Fig. 7.1 Levels of normalization

Fig. 7.1 Levels of normalization 7 Normalization 7.1 INTRODUCTION The concept of normalization was introduced in Section 3.2. A normalized relation may be defined as one for which each of the, underlying domains contains atomic (nondecomposable)