Database Systems. Basics of the Relational Data Model

Database Systems Relational Design Theory Jens Otten University of Oslo Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 1 / 30 Basics of the Relational Data Model title year length genre Star Wars 1977 124 SciFi Gone With the Wind 1939 231 drama Wayne s World 1992 95 comedy relation: a two-dimensional table representing data attributes: first line describing the meaning of entries tuples: rows of a relation (other than header row) containing the attribute names; one component for each attribute schema: name of a relation together with a set of attributes; e.g. Movies(title,year,length,genre), in general: R(A,B,C) domains: determine the set of values the components of a tuple can belong to; each attribute has an associated domain key: set of attributes such that no two tuples have the same value in all the attributes of the key relational database (DB): consists of one or more relations Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 2 / 30

Relational Algebra Query language SQL incorporates relational algebra at its core. set operations: union, intersection, difference \ projection: π A1,...,A n (Relation-Name) results in a relation hat has only the attributes A 1,..., A n selection: σ C (Relation-Name) with condition C results in a relation that has only those tuples satisfying C cartesian product (R S): pair tuples to form a new relation natural join (R S): pair only those tuples that agree on the common attributes of R and S theta-join (R C S): product considers only tuples satisfying C Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 3 / 30 Design Theory for Relational DBs Goal: designing good relations without flaws identify dependencies between entries in a relation define functional dependency: generalization of idea of a key anomalies: problems that occur because of dependencies use notion of functional dependencies to define normal forms (that do not have these anomalies) normalization: decompose relations into two or more relations (in order to remove anomalies) Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 4 / 30

Functional Dependencies Definition (Functional Dependency (FD)). Let A 1,..., A n, B 1,..., B m be attributes (n, m 1) of a relation R. Then A 1,..., A n functionally determine B 1,..., B m, written A 1 A 2...A n B 1 B 2...B m iff the following condition holds: if two tuples of R have the same values on all attributes A 1,..., A n, then they also have the same values on all attributes B 1,..., B m. t A 1... A n B 1... B m u if t and u agree here then they must agree here resembles regular functions f : x 1 = x 2 f (x 1 ) = f (x 2 ) relation R satisfies a FD, if FD is true for every instance of R A 1 A 2...A n B 1 B 2...B m is equivalent to A 1 A 2...A n B 1, A 1 A 2...A n B 2,..., A 1 A 2...A n B m Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 5 / 30 Example: Functional Dependencies Relation: Movies1(title,year,length,genre,studioName,starName) title year length genre studioname starname Star Wars 1977 124 SciFi Fox Carrie Fisher Star Wars 1977 124 SciFi Fox Harrison Ford Gone With the Wind 1939 231 drama MGM Vivien Leigh Wayne s World 1992 95 comedy Paramount Dana Carvey Wayne s World 1992 95 comedy Paramount Mike Meyers the following is a FD: title year length genre studioname (assumption: no two movies in the same year with same title) the following is not a FD: title year starname (there might be more than one star in a movie) Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 6 / 30

Keys of Relations Goal: Use key as an identifier for an entry in a relation R. Definition (Key). A set of attributes A = {A 1, A 2,..., A n } is a key for a relation R iff: 1. A functionally determines all other attributes B i of R, i.e. A 1 A 2...A n B i for all other attributes B i of R (two distinct tuples do not agree on all A 1...A n ). 2. No proper subset of A functionally determines all other attributes of R, i.e. a key must be minimal. sometimes a key is also called a candidate key there might be more than one key (in this case it is common to designate one of the keys as primary key) Example: {title,year,starname} is a key for Movies1; but {title,year}, {year,starname}, and {title,starname} are not keys for Movies1 Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 7 / 30 Superkey Definition (Superkey). A set of attributes A = {A 1, A 2,..., A n } is a superkey iff it is a superset of a key A, i.e. A A for some key A. every key is a superkey, but not vice versa, as a superkey satisfies the first condition, but not necessary the second: minimality Example: Any superset of attributes of {title,year,starname} is a superkey, e.g. {title,year,starname,length,studioname} is a superkey. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 8 / 30

Rules about Functional Dependencies how to reason about FDs, e.g. assume that a relation satisfies a set of FDs, can we deduce other FDs the relation must satisfy? discovering additional FDs is essential when designing good schemas Example: Relation R(A, B, C) satisfies A B and B C. Then we can deduce that R also satisfies A C. Proof: Given tuples (a, b 1, c 1 ) and (a, b 2, c 2 ) that agree on attribute A. They also agree on B, i.e. b 1 = b 2, hence, they also agree on C, i.e. c 1 = c 2. Definition (S Follows From T ). A set of FDs S follows from a set of FDs T, if every relation instance that satisfies all FDs in T also satisfies all FDs in S. in general: s follows from t, t(x) s(x), if every element x that has property t also has property s Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 9 / 30 Equivalence, Splitting, Combining Definition (Equivalence of FDs). Two sets of FDs S and T are equivalent iff the sets of relation instances satisfying S and T are exactly the same, i.e. S follows from T and T follows from S. Theorem (Splitting/Combining Rule). The FD A 1 A 2...A n B 1 B 2...B m is equivalent to the set of FDs A 1 A 2...A n B i for i = 1,.., m. we can replace the FD A 1...A n B 1...B m by the set of FDs A 1...A n B i for 1 i m ( splitting ), and vice versa ( combining ) Example: title year length, title year genre, and title year studioname is equivalent to title year length genre studioname. But splitting title year -> length into title -> length and year -> length is not allowed. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 10 / 30

Trivial FDs Definition (Trivial FD). A FD A 1 A 2...A n B 1 B 2...B m is trivial iff {B 1,...B m } {A 1,...A n }. Example: title year title is trivial. Theorem (Trivial-Dependency Rule). Let a A 1 A 2...A n B 1 B 2...B m C 1 C 2...C k be a FD with {B 1, B 2,...B m } {A 1, A 2,..., A n } and {A 1, A 2,...A n } {C 1, C 2,..., C k } = {}. Then, this FD is equivalent to the non-trivial FD A 1 A 2...A n C 1 C 2...C k. we can replace FDs by appropriate non-trivial FDs Example: Replace title year title length by title year length. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 11 / 30 Closure of Attributes The closure of attributes is an important property when dealing with keys. Definition (Closure of Attributes). Given a set of attributes A = {A 1, A 2,..., A n } and a set S of FDs. The closure of A is the set B of all attributes, such that A 1 A 2...A n B follows from the set S. The closure of {A 1, A 2,..., A n } is denoted by {A 1, A 2,..., A n } +. it is {A 1, A 2,..., A n } {A 1, A 2,..., A n } + as A 1 A 2...A n A i for i = 1,..., n Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 12 / 30

Algorithm to Compute Closure of Attributes Idea: Start with given set of attributes and repeatedly expand it by adding right sides of FDs when all their left attributes are included. INPUT: Set of attributes A = {A 1, A 2,..., A n } and set S of FDs (such that all FDs contain only one attribute on the right side) OUTPUT: {A 1, A 2,..., A n } + 1. Set X = {A 1, A 2,..., A n }. 2. Select FD B 1...B m C from S with {B 1,..., B m } X and C X. 3. Add C to set X and go to step 2. 4. If there is no such FD, then X is the set {A 1, A 2,..., A n } +. Example: Relation with FDs AB C, BC A, BC D, D E, CF B. What is the closure {A, B} + of {A, B}? a. X = {A, B}, b. X = {A, B, C} (because of AB C), c. X = {A, B, C, D} (because of BC D), d. X = {A, B, C, D, E} (because of D E). Thus, {A, B} + = {A, B, C, D, E}. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 13 / 30 Consequences of FDs By computing the closure of any set of attributes, we can test whether any given FD A 1 A 2...A n B follows from a set of FDs S. Theorem (Consequences of FDs). A FD A 1 A 2...A n B follows from a set S of FDs iff B {A 1, A 2,..., A n } +. An FD A 1 A 2...A n B 1...B m follows from a set S of FDs iff {B 1,..., B m } {A 1 A 2...A n } +. Proof: We need to prove that the algorithm neither computes too few nor too many FDs; it computes exactly those FDs that follow from S (induction on the number of times we extend the set X ). Example (continued from previous slide): 1. Does AB D follow from the given FDs? Yes, since {A, B} + = {A, B, C, D, E} includes D. 2. Does AB F follow from the given FDs? No, since {A, B} + = {A, B, C, D, E} does not include F. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 14 / 30

Transitivity Theorem (Transitivity Rule). If A 1...A n B 1...B m and B 1...B m C 1...C k hold in relation R, then A 1...A n C 1...C k holds in R as well. Proof: Compute closure of {A 1,..., A n }. Then B 1...B m are in {A 1,..., A n } + (1st FD); then C 1...C k are in {A 1,..., A n } + (2nd FD). Example: title year length genre studioname studioaddr Star Wars 1977 124 SciFi Fox Hollywood Eight Below 2005 120 drama Disney Buena Vista Wayne s World 1992 95 comedy Paramount Hollywood It is title year studioname and studioname studioaddr. Hence title year studioaddr holds as well. Fact: {A 1, A 2,..., A n } + is the set of all attributes of a relation if and only if {A 1, A 2,..., A n } is a superkey for the relation. Test if A={A 1,..., A n } is a key for relation R by checking that A + is set of all attributes and there is no real subset X of A such that X + is set of all attributes. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 15 / 30 A Complete Set of Inference Rules If we want to know whether one FD follows from some given set of FDs, the closure algorithm can be used. An alternative way is through Armstrong s Axioms from which any FD that follows from a given set of FDs can be derived. Definition (Armstrong s Axioms). 1. Reflexivity (trival FDs): If {B 1,..., B m } {A 1 A 2...A n } then A 1 A 2...A n B 1...B m. 2. Augmentation: If A 1 A 2...A n B 1 B 2...B m then A 1 A 2...A n C 1...C k B 1 B 2...B m C 1...C k for any set of attributes {C 1,..., C k }. 3. Transitivity: If A 1 A 2...A n B 1...B m and B 1...B m C 1...C k then A 1 A 2...A n C 1...C k. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 16 / 30

Anomalies in Relation Schemas Goal: Avoid redundancy and anomalies in database schemas. Anomalies in relation schemas: 1. Redundancy: Information may be repeated in several tuples (e.g. length or genre) 2. Update Anomalies: Changing information in one tuple but not in another one (e.g. changing length of Star Wars) 3. Deletion Anomalies: If a value of a tuple becomes empty, we may loose other information as a side effect (e.g. delete Vivien Leigh from Gone With the Wind) Example: Relation Movies1 title year length genre studioname starname Star Wars 1977 124 SciFi Fox Carrie Fisher Star Wars 1977 124 SciFi Fox Harrison Ford Gone With the Wind 1939 231 drama MGM Vivien Leigh Wayne s World 1992 95 comedy Paramount Dana Carvey Wayne s World 1992 95 comedy Paramount Mike Meyers Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 17 / 30 Decomposing Relations Decomposition: breaking a relation schema (set of attributes) into smaller schemas in order to eliminate anomalies. Definition (Decomposing Relations). Given a relation R(A 1,..., A n ), we may decompose R into two relations S(B 1,..., B m ) und T (C 1,..., C k ) such that: 1. {A 1,..., A n } = {B 1,..., B m } {C 1,..., C k } 2. S = π B1,...,B m (R) 3. T = π C1,...,C k (R) Example: Decompose relation Movies1 into 1. a relation called Movies2, whose schema consists of all the attributes except starname, and 2. a relation called Movies3, whose schema consists of the attributes title, year, and starname. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 18 / 30

Decomposing Relations Example: The relation Movies2 and Movies3. title year length genre studioname Star Wars 1977 124 SciFi Fox Gone With the Wind 1939 231 drama MGM Wayne s World 1992 95 comedy Paramount title year starname Star Wars 1977 Carrie Fisher Star Wars 1977 Harrison Ford Gone With the Wind 1939 Vivien Leigh Wayne s World 1992 Dana Carvey Wayne s World 1992 Mike Meyers this decomposition eliminates the anomalies mentioned before, i.e. length and genre appear only once, the risk of an update anomaly (e.g. if we change length) and a deletion anomaly (e.g. if we delete the only star from Gone With the Wind) is gone Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 19 / 30 Boyce-Codd Normal Form Condition under which anomalies can be guaranteed not to exist. Definition (Boyce-Codd Normal Form (BCNF)). A relation R is in Boyce-Codd Normal Form (BCNF) iff the following condition holds: whenever there is a non-trivial FD A 1...A n B 1...B m for R, then {A 1,..., A n } is a superkey of R. Example: In Movies1 {title,year,starname} is the only key. Now, consider the FD title year length genre studioname for Movies1. {title,year} is not a superkey (as title and year do not determine starname); hence, Movies1 is not in BCNF. In Movies2 the only key is {title,year} and every non-trivial FD, e.g. title year length genre studioname, has at least title and year on its left side, hence, their left sides must be superkeys. Thus, Movies2 is in BCNF. Fact: Any two-attribute relation is in BCNF. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 20 / 30

Decomposition into BCNF We can break any relation schema into a collection of subsets of its attributes, such that 1. these subsets are schemas of relations in BCNF; 2. the data in the relations resulting of the decomposition faithfully represent the data in the original relation. Idea: 1. Identify non-trivial FD A 1 A 2...A n B 1...B m that violates BCNF, i.e. {A 1, A 2,..., A n } is not a superkey. 2. Break attributes into two overlapping relation schemas a. {A 1, A 2,..., A n, B 1,..., B m } and b. {A 1, A 2,..., A n, C 1,..., C k } where {C 1,..., C k } are all attributes not involved in the FD. Example: In Movies1 the following FD violates the BCNF condition title year length genre studioname; we decompose into 1. the schema {title,year,length,genre,studioname} and 2. the schema {title,year,starname}. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 21 / 30 The BCNF Decomposition Algorithm INPUT: A relation R 0 with a set of FDs S 0 OUPUT: A decomposition of R 0 into a collection of relations that are all in BCNF Apply the following steps recursively. Initially set R = R 0, S = S 0. 1. If R is in BCNF, then return {R}; otherwise... 2. Let X Y be a FD that violates the BCNF condition. Compute X +. Choose R 1 = X + as one relation schema and let R 2 = X {C i C i is attribute of R with C i X + } be the other. 3. Compute S 1 and S 2, the sets of FDs for R 1 and R 2. 4. Recursively decompose R 1 and R 2 and return the union of the results of these decompositions. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 22 / 30

Consequences of Decomposing decomposing a relation schema in to BCNF will avoid anomalies but decomposing can also have some negative consequences Requirements for decompositions: 1. Elimination of anomalies (as mentioned before) 2. Recoverability of information (can we recover the original relation?) 3. Preservation of dependencies (when reconstructing the original relation the result will satisfy the original FDs) The presented decomposition algorithm will guarantee 1. and 2., but not necessarily 3. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 23 / 30 Recovering Information from a Decomposition Definition (Lossless Join). A decomposition has a lossless join iff the original relation R can be reconstructed from the relations of the decomposition. Example: R(A, B, C) with FD B C that is a BCNF violation. The decomposition separates the attributes into the relations R 1 (A, B) and R 2 (B, C). Let t=(a, b, c) be a tuple of R. Then (a, b) is in R 1 (A, B) and (b, c) is in R 2 (B, C). When computing the natural join R 1 R 2, the result, again, contains the tuple t=(a, b, c). Now, consider the tuples t=(a, b, c) and v(d, b, e) of R. Then (a, b) is in R 1 (A, B) and (b, e) is in R 2 (B, C). In the natural join this leads to the tuple (a, b, e). Could this not be a tuple of R? Answer: no, as we assumed B C, i.e. if two tuples agree on B, they also agree on C, hence, c=e and (a, b, e) is indeed (a, b, c), which is in R. Fact 1: If we decompose a relation according to the algorithm, then the original relation can be recovered exactly by natural join. Fact 2: The natural join is associative and commutative. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 24 / 30

The Chase Test for Lossless Join Lossless join does not work for arbitrary decompositions. Question: Given a general decomposition of R into the relations with sets of attributes S 1,..., S k. Can we recover R by taking the natural join of all these relations, i.e. is it true that π S1 (R)... π Sk (R) = R? We need to show: 1. Any tuple t in R is in π S1 (R)... π Sk (R). 2. Any tuple t in π S1 (R)... π Sk (R) is in R. Answer: 1. Clear, as the projection of t onto S i is in π Si (R) for each i and therefore t is in the resulting relation of the join. 2. Use chase test and consider FDs that hold for R. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 25 / 30 The Chase Test Example: Suppose we have a relation R(A, B, C, D), which we have decomposed into relations with sets of attributes S 1 ={A, D}, S 2 ={A, C}, and S 3 ={B, C, D}. Then, the so-called tableau for this decomposition is A B C D a b 1 c 1 d a b 2 c d 2 a 3 b c d first row corresponds to set of attributes A and D; we use unsubscripted letters a and d, and add for the other attributes b and c the subscript 1 in order to indicate arbitrary values the tuple (a, b 1, c 1, d) represents a tuple that contributes to (a, b, c, d) but we know nothing about the values for the attributes b and c; similar procedure for the other rows we chase the tableau by applying the FDs to equate symbols in the tableau whenever we can; if one of the rows becomes the same as t (all symbols are unsubscripted), then we have proved that any tuple t in the join of projections was actually a tuple of R Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 26 / 30

The Chase Test Example: A B C D a b 1 c 1 d a b 2 c d 2 a 3 b c d A B C D a b 1 c 1 d a b 1 c d 2 a 3 b c d Suppose the given FDs are A B, B C and CD A. first two rows agree on A, hence, they must also agree on B first two rows agree on A, hence, they must also agree on C A B C D a b 1 c d a b 1 c d 2 a 3 b c d A B C D a b 1 c d a b 1 c d 2 a b c d first and third row agree on C and D, hence they must agree on A then, the last row is equal to t=(a, b, c, d) hence, if R satisfies the given FDs and we project R onto {A, D}, {A, C}, and {B, C, D} and rejoin, then we only get tuples that have been in R Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 27 / 30 Why the Chase Works? Questions: 1. When the chase results in a row that matches t, why must the join be lossless? 2. When, after applying FDs, there is no row of all unsubscripted variables, why must the join not be lossless? Proof: 1. The process itself provides the answer. 2. Think of the tableau as an instance of the relation R. It satisfies the given FDs, because none can be applied any more. When we project the relation onto the S i s and take the natural join we get a tuple with unsubscripted variables, which is not in R. Hence, the join is not lossless. Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 28 / 30

Example: Why the Chase Works? Consider R(A, B, C, D) with the FD B AD and the proposed decomposition {A, B}, {B, C}, and {C, D}. A B C D a b c 1 d 1 a 2 b c d 2 a 3 b 3 c d applying B AD, we deduce the final tableau A B C D a b c 1 d 1 a b c d 1 a 3 b 3 c d hence, this decomposition does not have a lossless join; the resulting natural join has two more tuples (a, b, c, d) and (a 3, b 3, c, d 1 ) not in R Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 29 / 30 Summary functional dependencies (FDs) useful for defining appropriate keys several rules about FDs, e.g. transitivity, closure of attributes anomalies: redundancy, update anomalies, deletion anomalies eliminate anomalies by decomposing into smaller relation schemas the resulting relations should represent exactly the data in the original relation (after joining them) for relations that are decomposed into the Boyce-Codd Normal Form (BCNF) a lossless join is always possible use chase test (for arbitrary decompositions) to find out if a lossless join is possible Jens Otten (UiO) Database Systems Relational Design Theory INF3100 Spring 18 30 / 30