Relational Design: Characteristics of Well-designed DB

1. Minimal duplication Relational Design: Characteristics of Well-designed DB Consider table newfaculty (Result of F aculty T each Course) Id Lname Off Bldg Phone Salary Numb Dept Lvl MaxSz 20000 Cotts 103 DuPont 1234 45000 867 EE 5 25 20000 Cotts 103 DuPont 1234 45000 652 CIS 5 25 00333 Garth 423 DuPont 4321 87000 323 EE 3 25 00333 Garth 423 DuPont 4321 87000 413 EE 4 25 55555 Jones 211 Ewing 9876 55000 230 MATH 2 60 20001 Clarke 103 DuPont 1235 96000 120 AST 1 60 20001 Clarke 103 DuPont 1235 96000 450 AST 4 15... Problems: (a) Wasted storage (b) Insertion anomalies Problem adding a new faculty member Many NULLs adding a new course (c) Deletion anomalies Problem deleting MATH 230 (Also results in firing of Jones) (d) Update anomalies If change Cotts office number, must do so in multiple rows Above problems result because semantics of newfaculty not clear Represents info about more than a single entity type 2. Represent all info in specs newfaculty cannot represent info about a non-teaching faculty member 1

Relational Design: Characteristics of Well-designed DB (2) 3. Prevent info from being lost Duplication can be minimized by decomposition: Breaking table into several new ones Consider tables newfac: Id Lname Off Bldg Phone Salary 20000 Cotts 103 DuPont 1234 45000 00333 Garth 423 DuPont 4321 87000 55555 Jones 211 Ewing 9876 55000 20001 Clarke 103 DuPont 1235 96000... and newcourse: Off Bldg Numb Dept Lvl MaxSz 103 DuPont 867 EE 5 25 103 DuPont 652 CIS 5 25 423 DuPont 323 EE 3 25 423 DuPont 413 EE 4 25 211 Ewing 230 MATH 2 60 103 DuPont 120 AST 1 60 103 DuPont 450 AST 4 15... 2

Relational Design: Characteristics of Well-designed DB (3) Now, consider their join newf ac newcourse: Id Lname Off Bldg Phone Salary Numb Dept Lvl MaxSz 20000 Cotts 103 DuPont 1234 45000 867 EE 5 25 20000 Cotts 103 DuPont 1234 45000 652 CIS 5 25 20000 Cotts 103 DuPont 1234 45000 120 AST 1 60 20000 Cotts 103 DuPont 1234 45000 450 AST 4 15 00333 Garth 423 DuPont 4321 87000 323 EE 3 25 00333 Garth 423 DuPont 4321 87000 413 EE 4 25 55555 Jones 211 Ewing 9876 55000 230 MATH 2 60 20001 Clarke 103 DuPont 1235 96000 120 AST 1 60 20001 Clarke 103 DuPont 1235 96000 450 AST 4 15 20001 Clarke 103 DuPont 1235 96000 450 EE 5 25 20001 Clarke 103 DuPont 1235 96000 450 CIS 5 25... Result has more info than original tables Result is a loss of info, as cannot tell which rows are valid Note that problem due to fact that intersection of attributes does not contain a key 3

Relational Design: Functional Dependencies (FDs) FD expresses a constraint on values of a set of attributes imposed by another set Formally: Let X, Y R Y is functionally dependent on X, denoted X Y, iff for all tuples t 1, t 2 r(r), t 1 [Y ] = t 2 [Y ] whenever t 1 [X] = t 2 [X] Denoted X Y Represent common sense rules on data Referred to as semantic constraints Are time invariant FD holds on relation if every row of relation meets the FD s constraint Relation satisfies an FD if every row of relation meets the FD s constraint Cannot infer an FD from a data set Can infer what are not FDs from data set If X Y, and Y X, then X Y is trivial Equivalent notations: Reasons FDs are important: 1. Can be used to eliminate undesirable characteristics from DBs 2. Can be used as constraints on data 4

Relational Design: Closure of a Set of FDs Generally, a set of FDs implicitly imply additional FDs The closure of a set FDs F : Set of FDs logically implied by F Denoted F + Armstrong s Axioms allow computation of F + 1. Reflexivity rule: Given set of attributes X and Y X, then X Y 2. Augmentation rule: If X Y, and Z is a set of attributes, then XZ Y Z (OR, If X Y, then XZ Y ) 3. Transitivity rule: If X Y, and Y Z, then X Z Armstrong s axioms are sound and complete Supplemental axioms: 1. Union rule: If X Y, and X Z, then X Y Z 2. Decomposition rule: If X Y Z, then X Y and X Z 3. Pseudotransitivity rule: If X Y, and Y W Z, then XW Z Full family of FDs: A set of FDs F is said to be a full family of FDs if F = F + 5

Closure of X under F : Relational Design: Closure of a Set of Attributes The set of attributes functionally determined by a set of FDs F when applied to a set of attributes X Denoted X + To determine X +, the closure of X under F : X + X do { oldx + X + for (each Y Z F ) if (Y X + ) X + X + Z } until (oldx + = X + ) K schema R is a superkey of R if for all tuples t 1, t 2 r(r), t 1 = t 2 whenever t 1 [K] = t 2 [K] I.e., K R Full functional dependency: Y is fully functionally dependent on X in FD X Y if there is no subset of X on which Y is dependent I.e., for any Z X, Z Y X is said to be irreducible C R is a candidate key of R if C R and C is irreducible To find a key K for scheme R: K R for (each a i K) { T (K a i ) + if ((K a i ) + = R K K a i } 6

Given sets of FDs F and G F covers G if G F + Relational Design: Equivalence of sets of FDs F and G are equivalent if F covers G and G covers F I.e., F + G + To determine whether X Y F +, compute X + WRT F If Y X +, then X Y F + To determine equivalence of F and G 1. For each X Y F, compute X + WRT G (a) If Y X +, then X Y G + (b) If fail for any FD, stop. G does not cover F 2. For each A B G, compute A + WRT F (a) If B A +, then A B F + (b) If fail for any FD, stop. F does not cover G 3. If succeed for all FDs, F G 7

Relational Design: Minimal (Canonical) Covers Minimal cover of a set of FDs F : Smallest set of FD s that is equivalent to F Denoted F c Importance: Represents the smallest set of constraints needed to test against when insert or modify data Minimal cover has the following properties 1. Every FD has a single attribute on right side 2. No left-hand side has extraneous attributes I.e., every left-hand side is irreducible a is extraneous in X if (F c (X y)) ((X a) y) F 3. No FD is redundant; I.e., X y is redundant if (F c (X y)) F To compute the minimal cover G for F : 1. G F 2. For each FD of the form X a 1, a 2,..., a n, replace by X a 1, X a 2,..., X a n 3. For each FD X a G, delete all extraneous attributes 4. Delete each redundant FD X a from G 8

Relational Design: Decomposition A decomposition of a scheme R is a set of subschemas derived from R Subschemas called projections of R Formally: Let R be a relational scheme Then {R 1, R 2,..., R n } is a decomposition of R if R 1 R 2... R n = R 9

Relational Design: Lossless Join Decomposition Lossy join decomposition creates more rows than the original table had before the decomposition Info is lost because there is no way of knowing which tuples are valid Added tuples called spurious tuples Lossless join decomposition exactly reproduces the original table from which the decomposition was generated Formally: Given 1. Scheme R, 2. Relation r(r), 3. Decomposition D = {R 1, R 2,..., R n }, and 4. Relations r 1 (R 1 ), r 2 (R 2 ),..., r n (R n ), where r 1 = π R1 (R) Then D is a lossless join decomposition of R if r 1 r 2... r n = r A decomposition of R is lossless if either 1. R 1 R2 (R 1 R 2 ) 2. R 1 R2 (R 2 R 1 ) 10

Relational Design: Lossless Join Decomposition (2) Algorithm to determine whether decomposition is lossless: Given 1. A set of FDs F, 2. schema R(A 1, A 2,..., A n ), and 3. decomposition D = R 1, R 2,..., R k Steps: 1. Construct table with n columns and k rows Rows correspond to k subschemas R i Columns correspond to n attributes A j 2. In table[i, j], put a j if A j R i Otherwise, put b ij 3. For each FD α β F Look for 2 rows that have matching values for every A j α Set the column values that correspond to the attributes in β to the same values for these 2 rows The goal is to replace b ij with a j 4. Continue until either (a) No more changes can be made, or (b) A row contains α 1, α 2,..., α n 5. If a row contains α 1, α 2,..., α n, The decomposition is lossless 11

Consider Relational Design: Dependency Preservation - Motivation R Snum City Status s1 London 20 s2 Paris 10 s3 Paris 10 s4 London 20 and FDs Snum City City Status Now consider the following decompositions 1. 2. S1 Snum City s1 London s2 Paris s3 Paris s4 London T1 Snum City s1 London s2 Paris s3 Paris s4 London Both decompositions are LLJ S2 City Status London 20 Paris 10 T2 Snum Status s1 20 s2 10 s3 10 s4 20 12

Relational Design: Dependency Preservation - Motivation (2) Suppose you wanted to insert the data (s5, London, 30) into each decomposition For decomposition S This would require inserting 1. < s5, London > into S1, and 2. < London, 30 > into S2 The insert into S2 would violate FD 2 For decomposition T This would require inserting 1. < s5, London > into T 1, and 2. < s5, 30 > into T 2 The fact that FD 2 is violated is not obvious from an examination of the individual tables The only way to determnine whether FD 2 is violated in T is to join T 1 and T 2 13

Relational Design: Dependency Preservation Two subschemas are dependency preserving (independent) if updates can be made to either without involving the other If subschemas are interdependent, concern is that updating one could violate an FD Only way to check would be to join the subschemas DP is desirable because only need to worry about constraints that apply to single scheme, and not inter-scheme constraints Concern applies to non-transitive spanning FDs Will need to join tables to verify FD is satisfied Restriction of set of FDs Given 1. set of FDs F, 2. schema R, 3. decomposition D = {R 1, R 2,...} The restriction of F to R i is the set of FDs in F + that are wholly contained in R i I.e., X Y F i if X Y R i and X Y F Denoted F i F i is set of FDs that are easy to check wrt R i Let F = n i=1 F i Generally, F F But if F + F +, checking against F is equivalent to checking against F Dependency preservation A decomposition is dependency preserving if F + F + Rissamon s Theorem: A decomposition {R 1, R 2 } of R is DP if 1. {F 1 F 2 } + = F + 2. R 1 R 2 is candidate key of R 1 or R 2 14

Relational Design: Dependency Preservation - Algorithms Given scheme R, decomposition D = R 1, R 2,...t, and F To determine dependency preservation of F compute F + for (each R i in D) F i restriction of F to R i F F i compute F + if (F + = F + ) return TRUE else return FALSE To determine dependency preservation of α β in F oldresult φ result α while (oldresult!= result) { oldresult result for (each R i ) { I result R i C I + T C R i result result T } } if (β in Result) return TRUE else return FALSE 15

Relational Design: Normalization - Intro Normal form is a set of constraints on a DB schema Normal forms: Form Alt Name Restrictiveness Duplication 1NF least most 2NF 3NF Boyce-Codd NF 4NF 5NF Project-Join 6NF Domain-Key most least Normalization is process of converting schemas to higher normal forms** Denormalization is process of converting to a lower normal form Relational model only requires 1NF Except for 1NF, normal forms based on dependencies (2, 3, BNF: FDs; 4: MVDs; 5: JDs; 6: DK) Normalization is bottom-up approach to design General approach is to start with a single schema containing all attributes Normalization iteratively refines decompositions When using top-down approach, can be used to verify that existing schemas conform to a particular normal form 16

Relational Design: Normalization - 1NF A schema R is in 1NF if all attributes are atomic Composites: flatten (ala ER-RM mapping) Multivalued: 1. Decompose into 2 tables (ala ER-RM mapping): R 1 contains PK + MV attribute R 2 contains R MV attribute 2. Use 1 table: For each key, have one row for each value of the MV attribute Results in lots of duplication 17

Relational Design: Normalization - 2NF A non-prime attribute is not part of a candidate key An attribute is fully dependent on a set of FDs if it is not dependent on a subset of those attributes A schema R is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on the PK of R Alternative definition: No non-prime attribute is partially dependent on the PK To normalize to 2NF: 1. For every schema R in which FD X Y F + violates 2NF (a) Replace schema R with schemas i. R 1 = X Y ii. R 2 = R Y 18

Transitive dependency Relational Design: Normalization - 3NF Y is transitively dependent on the PK if there is a Z such that 1. X Z, 2. Z Y, and 3. Z P K X Y is a transitive dependency on the PK A schema R is in 3NF if it is in 2NF and every non-prime attribute is nontransitively dependent on the PK of R To normalize to 3NF: 1. For every table R in which FD Z Y violates 3NF (a) create 2 tables: i. R 1 = Z Y ii. R 2 = R Y Codd s definition of 3NF is based on 2NF Given schema R and set of FDs F, create a 3NF decomposition directly from 1NF by: 1. Find minimal cover F c of F 2. For each unique set of attributes appearing on the lefthand side of an FD X Y i F c (a) Create a schema consisting of X n i=1 Y i 3. Create a schema containing any attributes of R not included in the previous step 4. If none of the schemas created contain a candidate key (a) Create a schema containing a candidate key Resulting DB schema guaranteed to be 1. Lossless join 2. Dependency preserving Not unique 19

Relational Design: Normalization - General Definitions for 2NF and 3NF A schema R is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on every CK of R A schema R is in 3NF if it is in 2NF and every non-prime attribute is nontransitively dependent on every CK of R Alternative 3NF def: A schema R is in 3NF if for every non-trivial FD X Y F +, either 1. X is a superkey of R, or 2. Every attribute in Y is prime 20

Relational Design: Normalization - BCNF A schema R is in BCNF if every attribute is fully functionally dependent on every CK of R Alternative BCNF def: A schema R is in BCNF if, for every non-trivial FD X Y F +, X is a superkey of R To normalize to BCNF: 1. For every schema R in which FD X Y F + violates BCNF (a) Replace schema R with schemas i. R 1 = X Y ii. R 2 = R Y Resulting DB schema guaranteed to be Lossless join Resulting DB schema not guaranteed to be Dependency preserving Unique 3NF schema may not be in BCNF only when 1. Multiple CKs exist 2. CKs overlap 21

Relational Design: Normalization - Multivalued Dependencies (MVDs) FDs represent constraints that specify what should not appear in a table Multivalued dependencies represent constraints that specify what must appear in a relation Generally arises when 1 schema represents more than one 1:n relations Expressed as X Y Formally: Given 1. schema R, 2. r(r), 3. X R, 4. Y R, 5. t 1, t 2, t 3, t 4 r, and 6. Z = R (X Y ) MVD X Y holds on R if for all tuples t 1, t 2 where t 1 [X] = t 2 [X], there exist tuples t 3, t 4 r such that 1. t 3 [X] = t 4 [X] = t 1 [X] = t 2 [X] 2. t 3 [Y ] = t 1 [Y ] = t 2 [Y ] = t 4 [Y ] 3. t 3 [Z] = t 2 [Z] = t 4 [Z] = t 1 [Z] If X Y holds, then X Z holds FDs are equality-generating; MVDs are tuple-generating If an MVD is not satisfied, can add tuples to amend the situation MVD X Y is trivial if Y X or X Y = R Closure of a set of FDs and MVDs F, denoted F +, is set of all dependencies implied by F 22

Inference rules for MVDs: IR1-3 Armstrong s Axioms Relational Design: Normalization - MVDs (2) IR4 (complementation) {X Y } = {X (R (X Y ))} IR5 (augmentation) If {X Y } and Z W then {W X Y Z} IR6 (transitivity) {X Y, Y Z} = {X (Z Y )} IR7 (replication) {X Y } = {X Y } IR8 (coalescence) If {X Y } and exists W such that W Y is empty, W Z, and Z Y, then X Z IR9 (union) {X Y } and {X Z} = {X Y Z} IR10 (intersection) {X Y } and {X Z} = {X Y Z} IR11 (difference) {X Y } and {X Z} = {X Y Z} and {X Z Y } 23

Relational Design: Normalization - 4NF A schema R is in 4NF wrt F if for every X Y in F + 1. X Y is trivial, or 2. X is a super key of R Theorem: Decomposition {R 1, R 2 } of R is LLJ if 1. R 1 R 2 R 1, or 2. R 1 R 2 R 2 To normalize to 4NF: 1. For every schema R in which MVD X Y violates 4NF (a) create 2 schemas i. R 1 = X Y ii. R 2 = R Y The restriction of F to R i : Let 1. F be a set of FDs and MVDs on R, 2. D = R 1, R 2,..., R n be a decomposition of R The restriction of F to R i consists of 1. The set of FDs in F + X Y, where X R R i 2. The set of MVDs in F + X Y R i, where X R R i and X Y F + Denoted F i A decomposition D = R 1, R 2,..., R n is dependency preserving wrt F if for every r 1 (R 1 ), r 2 (R 2 ),..., r n (R n ), where r i satisfies F i, there exists r(r) that satisfies F and for which r i = Π Ri (r) for all i 24

Relational Design: Normalization - Join Dependencies (JDs) JD of schema R with respect to decomposition {R 1, R 2,...} Specifies a constraint on r(r) such that the only legal relations r(r) are those for which Π R1 (r) Π R2 (r)... = r(r) Denoted (R 1, R 2,...) Trivial JD is one in which at least 1 R i = R Relation between JDs and MVDs: For JDs consisting of 2 subschemas, i.e. R and JD (R 1, R 2 ) 1. R 1 = (R 1 R 2 ) (R 1 R 2 ) 2. R 2 = (R 1 R 2 ) (R 2 R 1 ) This represents 2 independent relations of (R 1 R 2 ) One with (R 1 R 2 ), and one with (R 2 R 1 ) Therefore, (R 1 R 2 ) (R 1 R 2 ), and (R 1 R 2 ) (R 2 R 1 ) Every JD with n = 2 represents an MVD Not true for n > 2 25

Relational Design: Normalization - 5NF (Project Join NF) Schema R is in 5NF wrt set of FDs, MVDs, JDs F if for every non-trivial JD (R 1, R 2,...) F +, every R i is a superkey of R To convert to 5NF: 1. Decompose R into {R 1, R 2,...} 26

Relational Design: Normalization - 6NF (Domain-Key NF) In DKNF, all constraints can be enforced by enforcing only the domain and key constraints Formally: Let 1. G be a set of general constraints on schema R, 2. D is the domain and K are the key constraints, 3. D, K G Then R is in DKNF if D K = G 27

Relational Design: Normalization - Inclusion Dependencies Represent a way of representing inter-relational constraints Cannot use DFs or MVDs to represent referential constraints Cannot use super - subclass constraints Formally: Given 1. schemas R, S, 2. r(r), s(s), 3. X R, 4. Y S, and 5. X and Y are union compatible R.X < S.Y holds if Inference rules: π X (r(r)) π Y (s(s)) for every r(r), s(s) 1. Reflexive: R.X < R.X 2. Attribute Correspondence: If R.X < S.Y where (a) X = a i, a 2,..., a n, (b) Y = b i, b 2,..., b n, and (c) a i corresponds to b i Then R.a i < S.b i, 1 i n 28

Relational Design: Normalization - Template Dependencies Represent a formalism for representing any type of constraint 29