From Murach Chap. 9, second half. Schema Refinement and Normal Forms

Similar documents
Introduction to Data Management. Lecture #7 (Relational Design Theory)

Schema Refinement and Normal Forms

Lecture 5 Design Theory and Normalization

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 4 - Schema Normalization

Course Content. Database Design Theory. Objectives of Lecture 2 Database Design Theory. CMPUT 391: Database Design Theory

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2009 Lecture 3 - Schema Normalization

CSE 562 Database Systems

Review: Attribute closure

Informal Design Guidelines for Relational Databases

SCHEMA REFINEMENT AND NORMAL FORMS

Normalization. Murali Mani. What and Why Normalization? To remove potential redundancy in design

Database Management System Prof. Partha Pratim Das Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

UNIT 3 DATABASE DESIGN

CS411 Database Systems. 05: Relational Schema Design Ch , except and

Elmasri/Navathe, Fundamentals of Database Systems, Fourth Edition Chapter 10-2

Chapter 10. Chapter Outline. Chapter Outline. Functional Dependencies and Normalization for Relational Databases

Chapter 8: Relational Database Design

COSC Dr. Ramon Lawrence. Emp Relation

Chapter 10. Normalization. Chapter Outline. Chapter Outline(contd.)

CS 2451 Database Systems: Database and Schema Design

Relational Design: Characteristics of Well-designed DB

Schema Refinement: Dependencies and Normal Forms

Unit 3 : Relational Database Design

Functional Dependencies and Normalization for Relational Databases Design & Analysis of Database Systems

How to design a database

Schema Refinement: Dependencies and Normal Forms

Schema Refinement: Dependencies and Normal Forms

Database Design Theory and Normalization. CS 377: Database Systems

Functional Dependencies and. Databases. 1 Informal Design Guidelines for Relational Databases. 4 General Normal Form Definitions (For Multiple Keys)

FUNCTIONAL DEPENDENCIES

Chapter 14. Database Design Theory: Introduction to Normalization Using Functional and Multivalued Dependencies

Functional Dependencies & Normalization for Relational DBs. Truong Tuan Anh CSE-HCMUT

Database Normalization

Lecture 6a Design Theory and Normalization 2/2

Lecture 9: Database Design. Wednesday, February 18, 2015

Functional Dependencies and Finding a Minimal Cover

Database Management System

This lecture. Databases -Normalization I. Repeating Data. Redundancy. This lecture introduces normal forms, decomposition and normalization.

Lecture #8 (Still More Relational Theory...!)

Normalization. VI. Normalization of Database Tables. Need for Normalization. Normalization Process. Review of Functional Dependence Concepts

Databases -Normalization I. (GF Royle, N Spadaccini ) Databases - Normalization I 1 / 24

The Relational Data Model

Relational Database design. Slides By: Shree Jaswal

Normalization is based on the concept of functional dependency. A functional dependency is a type of relationship between attributes.

Functional Dependencies CS 1270

IS 263 Database Concepts

Relational Database Design (II)

Relational Database Design Theory. Introduction to Databases CompSci 316 Fall 2017

Lecture #16 (Physical DB Design)

A Deeper Look at Data Modeling. Shan-Hung Wu & DataLab CS, NTHU

Database Systems: Design, Implementation, and Management Tenth Edition. Chapter 6 Normalization of Database Tables

Relational Design Theory. Relational Design Theory. Example. Example. A badly designed schema can result in several anomalies.

CS 338 Functional Dependencies

The Entity-Relationship Model. Overview of Database Design

Conceptual Design. The Entity-Relationship (ER) Model

Administrivia. Physical Database Design. Review: Optimization Strategies. Review: Query Optimization. Review: Database Design

MODULE: 3 FUNCTIONAL DEPENDENCIES

Chapter 7: Relational Database Design

Database Management Systems Paper Solution

Schema Refinement & Normalization Theory 2. Week 15

Mapping ER Diagrams to. Relations (Cont d) Mapping ER Diagrams to. Exercise. Relations. Mapping ER Diagrams to Relations (Cont d) Exercise

Lecture 11 - Chapter 8 Relational Database Design Part 1

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Functional dependency theory

Database Design Principles

Databases The theory of relational database design Lectures for m

Steps in normalisation. Steps in normalisation 7/15/2014

Functional Dependencies and Normalization

CSCI 403: Databases 13 - Functional Dependencies and Normalization

NORMAL FORMS. CS121: Relational Databases Fall 2017 Lecture 18

Lectures 12: Design Theory I. 1. Normal forms & functional dependencies 2/19/2018. Today s Lecture. What you will learn about in this section

Normalisation Chapter2 Contents

The Relational Model and Normalization

Relational Database Systems 1

Schema And Draw The Dependency Diagram

Typical relationship between entities is ((a,b),(c,d) ) is best represented by one table RS (a,b,c,d)

Introduction to Data Management. Lecture #17 (Physical DB Design!)

Design Theory for Relational Databases

Database Design and Tuning

V. Database Design CS448/ How to obtain a good relational database schema

More Normalization Algorithms. CS157A Chris Pollett Nov. 28, 2005.

Unit IV. S_id S_Name S_Address Subject_opted

6.830 Lecture PS1 Due Next Time (Tuesday!) Lab 1 Out end of week start early!

Database design III. Quiz time! Using FDs to detect anomalies. Decomposition. Decomposition. Boyce-Codd Normal Form 11/4/16

Database Tables and Normalization

CS 461: Database Systems. Final Review. Julia Stoyanovich

Part II: Using FD Theory to do Database Design

CS352 Lecture - Conceptual Relational Database Design

Physical Database Design and Tuning. Review - Normal Forms. Review: Normal Forms. Introduction. Understanding the Workload. Creating an ISUD Chart

Relational Database Systems 1. Christoph Lofi Simon Barthel Institut für Informationssysteme Technische Universität Braunschweig

Friday Nights with Databases!

Applied Databases. Sebastian Maneth. Lecture 5 ER Model, Normal Forms. University of Edinburgh - January 30 th, 2017

CS352 Lecture - Conceptual Relational Database Design

customer = (customer_id, _ customer_name, customer_street,

Administrivia. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Relational Model: Keys. Correction: Implied FDs

Introduction to Data Management. Lecture #6 E-Rà Relational Mapping (Cont.)

Chapter 6: Relational Database Design

We shall represent a relation as a table with columns and rows. Each column of the table has a name, or attribute. Each row is called a tuple.

Relational Design Theory

SQL Workshop. Database Design. Doug Shook

Transcription:

From Murach Chap. 9, second half The need for normalization A table that contains repeating columns Schema Refinement and Normal Forms A table that contains redundant data (same values repeated over and over) CS430/630 Lecture 19 Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke Slide 2 The accounts payable system in third normal form vendors line_item_description default_ general_ledger_accounts default_ account_description _description _due_days This is our goal: third normal form for the tables The first three normal forms First (1NF) The value stored at the intersection of each row and column must be a scalar value, and a table must not contain any repeating columns. Second (2NF) Every non-key column must depend on the entire primary key. Third (3NF) Every non-key column must depend only on the primary key. Most designers stop at the third normal form We need to work on what depends on really means here. Also the 3NF definition assumes only one key in the table, see R&G, pg. 617 for complete definition that covers cases of multiple keys for a table. Slide 3 Slide 4 The next four normal forms Boyce-Codd (BCNF) A non-key column can t be dependent on another non-key column. Fourth (4NF) A table must not have more than one multivalued dependency, where the primary key has a one-to-many relationship to non-key columns. Fifth (5NF) The data structure is split into smaller and smaller tables until all redundancy has been eliminated. Domain-key (DKNF) or sixth (6NF) Every constraint on the relationship is dependent only on key constraints and domain constraints, where a domain is the set of allowable values for a column. Of these, we ll cover only BCNF. The above definition of BCNF isn't clear see R&G, pg 616. Note that BCNF is the same as 3NF for tables with one key. The benefits of normalization More tables, and each table with an index for its primary key. That makes data retrieval more efficient. Each table contains information about a single entity. That makes data retrieval and insert, update, and delete operations more efficient. Data redundancy is minimized, which simplifies maintenance and reduces storage. In cases of huge tables (data warehousing) the benefits of normalization may be outweighed by need to keep related data nearby for fast access. Note that nosql databases like MongoDB usually keep related data together in "documents", i.e., use denormalized models. See MongoDB doc. Slide 5 Slide 6

Normalization examples The invoice data with a column that contains repeating values The invoice data in first normal form The invoice data with repeating columns Only one data item in each cell, and no repeating columns These examples are not even first normal form First (1NF) The value stored at the intersection of each row and column must be a scalar value, and a table must not contain any repeating columns. Slide 7 Slide 8 The invoice data in second normal form Every non-key column must depend on the entire primary key. The first normal form with keys added The s are wrong here. The first 3 rows are invoice 1, then the next 2 rows are invoice 2, and the last row is invoice 3. Each value should have a particular value. Fix this on pg. 307. Why this isn't in second normal form: The key is (, ), but depends only on, part of the key. Slide 9 Top table: PK:, indivisible, so no way to depend on part of PK, thus 2NF. Bottom table: PK = (, ). The only other column depends on whole PK, so 2NF. For 2NF, couldn t have here because it depends only on, not whole PK This has gotten rid of a lot of redundancy, but some remains, like the repetition of the vendor name Slide 10 Y depends on X idea We will formalize this idea soon If Y depends on X, then each X value is associated with precisely one Y value in the table. i.e., if 2 rows agree on X, they agree on Y Here: depends on 2 rows that agree on must agree on That makes sense: the invoice is talking about one vendor selling things. On the other hand, Item_description depends on (, ) OK: the item being described is different for each (each line item of the invoice) The A/P system in second normal form invoice_line_item_description item_quantity item_unit_price By this design each invoice from a vendor has,, creating lots of redundancy (need 3NF ) But it is 2NF. Again, we could break 2NF by putting over in, which has PK (, ), while depends only on. Slide 12

The A/P system in second normal form Why is it not third normal form? The accounts payable system in third normal form Every non-key column must depend only on the primary key. invoice_line_item_description item_quantity item_unit_price Third normal form: Every non-key column must depend only on the primary key, not some other column for example. (Here assuming the table has only one key, "the" primary key) Here depends on (assuming that s a good id), not on just, the table s PK. We clearly want a, a sure unique id, so we can say the depends on the, etc. vendors default_ default description _due_days line_item_description general_ledger_accounts account_description Now the vendor attributes are in the vendors table, where they belong. Also has its own table. Surely should have quantity. Slide 13 Slide 14 The accounts payable system in fifth normal form vendors line_item_qty vendor_area_code_id line_item_unit_price line_item_description_id default_ line_item_descriptions default_ line_item_description_id invoice_line_item_description zip_codes zip_code general_ledger_accounts city state account_description area_codes _description area_code_id _due_days area_code Don t worry about this! Why Schema Refinement? We have learned the advantages of relational tables but how to decide on the relational schema? At one extreme, store everything in single table Huge redundancy Leads to anomalies! We need to break the information into several tables How many tables, and with what structures? Having too many tables can also cause problems E.g., performance, difficulty in checking constraints Slide 15 Sample Relation Hourly_Emps (ssn, name, lot, rating, wage, hrs_worked) Denote relation schema by attribute initials: SNLRWH Constraints (functional dependencies, or FDs for short) ssn is the key: S SNLRWH rating determines wage: R W E.g., worker with rating 10 receives 20$/hr Here X Y means Y depends on X: if 2 rows agree on X, they also agree on Y, or equivalently, disagree on Y => disagree on X It s always true that key other cols Here also R W: agree on R means agree on W Anomalies Problems due to R W being in Hourly_Emps Update anomaly: Change value of W only in a tuple, end up with a dependency violation Insertion anomaly: How to insert employee if we don t know hourly wage for that rating? Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! S N L R W H 123-22-3666 Attishoo 48 8 10 40 231-31-5368 Smiley 22 8 10 30 131-24-3650 Smethurst 35 5 7 30 434-26-3751 Guldu 35 5 7 32 612-67-4134 Madayan 35 8 10 40

Removing Anomalies Hourly_Emps2 Wages S N L R H R W 123-22-3666 Attishoo 48 8 40 8 10 231-31-5368 Smiley 22 8 30 131-24-3650 Smethurst 35 5 30 5 7 434-26-3751 Guldu 35 5 32 612-67-4134 Madayan 35 8 40 Create 2 smaller tables! Updating rating of employee will result in the wage changing accordingly Note that there is no physical change of W, just a pointer change Deleting employee does not affect rating-wages data Finding another FD lying in a table From hw4, and createdb.sql: create table flights( flno int primary key, origin varchar(20) not null, destination varchar(20) not null, distance int, departs varchar(20), arrives varchar(20), price decimal(7,2)); What s distance? it s the distance between the origin and destination airports, so the FD: origin, destination distance lies in the table 20 We can check the data for FD consistency Maybe there s only one each? SQL> select origin, destination, min(distance), max(distance) from flights 2 group by origin, destination; ORIGIN DESTINATION MIN(DISTANCE) MAX(DISTANCE) -------------------- -------------------- ------------- ------------- Los Angeles Chicago 1749 1749 Detroit New York 470 470 Pittsburgh New York 303 303 Los Angeles Sydney 7487 7487 Chicago New York 802 802 Los Angeles Honolulu 2551 2551 The rest check out too: there s only one distance between two cities in the table. SQL> select origin, destination, count(*) from flights 2 group by origin, destination; ORIGIN DESTINATION COUNT(*) -------------------- -------------------- ---------- Los Angeles Chicago 1 Detroit New York 1 Pittsburgh New York 1 Los Angeles Sydney 1 Chicago New York 1 Los Angeles Honolulu 2 Madison Pittsburgh 1 Los Angeles Tokyo 1 Madison Detroit 1 The rest are 1s, so only LA to Honolulu tests the FD, but it passed. LA to Honolulu: 2 rows with the same origin, destination, same distance. Why do we care? Normalization in practice 23 Possible anomalies in this situation: Delete anomaly: Delete all flights from Boston to Ithaca End up losing distance information on this link Insert anomaly: Add a flight from Boston to Ithaca Need to check if the distance is consistent with other rows Update anomaly: Correct the distance: need to check for all the cases, or break the dependency. As a database professional, need to keep an eye on table designs and possibly point out potential problems, esp. early, before the group has invested a lot of development work in their design. 24 We can create another table to hold this FD information: create table links( origin varchar(20), destination varchar(20), distance int, primary key(origin,destination) ); create table flights( flno int primary key, origin varchar(20) not null, destination varchar(20) not null, departs varchar(20), arrives varchar(20), price decimal(7,2), foreign key (origin, destination) references links );

Dealing with Redundancy Redundancy is at the root of redundant storage, insert/delete/update anomalies Integrity constraints, in particular functional dependencies, can be used to help identify redundancy Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD, or ACD and ABD) Decomposition should be used judiciously: Decomposition may sometimes affect performance. Why? What problems (if any) does decomposition cause? Incorrect data Loss of dependencies Functional Dependencies (FDs) A functional dependency X Y holds over relation R if for every instance r of R t1, t2 r, X (t1) = X (t2) implies Y (t1) = Y (t2) given two tuples in r, if the X values agree, Y values must also agree In R W example, all emps with rating 8 have wage 10, etc. FD is a statement about all allowable relations. Identified based on semantics of application (business logic) Given an instance r of R, we can check if it violates some FD f, but we cannot tell if f holds over R! FDs and Keys FDs are a generalization of keys A key uniquely identifies all attribute values in a tuple That is a particular case of FD but not all FDs must determine ALL attributes K is a key for R means that K R However, K R does not require K to be minimal! K can be a superkey as well Reasoning About FDs Given FD set F, we can usually infer additional FDs: F = closure of F is the set of all FDs that are implied by F Armstrong s Axioms (X, Y, Z are sets of attributes): Reflexivity: If Y X, then X Y Augmentation: If X Y, then XZ YZ for any Z Transitivity: If X Y and Y Z, then X Z These are sound and complete inference rules for FDs! Reasoning About FDs (cont d) Additional rules Not necessary, but helpful Can be proved from Armstrong s axioms Union and decomposition (splitting) X Y and X Z => X YZ X YZ => X Y and X Z An Example of FD Inference Pg. 613: a contract with id C is an agreement that a supplier S will supply Q items of part P to project J associated with department D: the value of this contract is V Contracts(cid, sid, jid, did, pid, qty, value), and: Contract id, supplier, project, department, part C is the key: C CSJDPQV Project purchases each part using single contract: JP C A certain project J and part P purchased determines a particular contract C Dept purchases at most one part from a supplier: SD P A certain department D and supplier determine a certain part P, if any

An Example of FD Inference Contracts(cid, sid, jid, did, pid, qty, value), and: Contract id, supplier, project, department, part C is the key: C CSJDPQV Project purchases each part using single contract: JP C Dept purchases at most one part from a supplier: SD P JP C, C CSJDPQV imply JP CSJDPQV SD P implies SDJ JP SDJ JP, JP CSJDPQV imply SDJ CSJDPQV So SDJ is a superkey for the table. Another Example of FD Inference Relation Hourly(ssn,rating,wage) Given S->R, R->W Then S-> W by transitivity S->R, S->W implies S->RW by union SS -> RWS by augmentation S -> RWS by definition of SS as union of attributes Thus S is a superkey, and indivisible, so a key And R->W is a FD lying in the table outside the key This non-key FD drives a decomposition, as we will see. Attribute Closure Attribute closure of X (denoted X + ) wrt FD set F: Set of all attributes A such that X A is in F + Set of all attributes that can be determined starting from attributes in X and using FDs in F Algorithm: X + = X Repeat Y=X + (remember starting X + value for this pass) Search all FDs in F with LHS completely included in X + Add RHS of each such FD to X + Until Y=X + (same as at start of this pass) Attribute Closure X + = X Repeat Y=X + (remember starting X + value) Find FDs in F with LHSs completely included in X + Add RHS of each such FD to X + Until Y=X + (same as at start of this pass) C is the key: C CSJDPQV (i.e., C -> R) Project purchases each part using single contract: JP Dept purchases at most one part from a supplier: SD Example: Compute JPD + JPD + = JPD, look for FDs with LHS in JPD, find JP -> C JPD+ = JPDC, look again for FDs with LHS in JPDC, find C-> R JPD + = CSJDPQV, all attributes, so done C P Attribute Closure : second example X + = X Repeat Y=X + (remember starting X + value) Find FDs in F with LHSs completely included in X + Add RHS of each such FD to X + Until Y=X + (same as at start of this pass) Relation Hourly(ssn,rating,wage) Given (1) S->R, (2) R->W S+ = S, look for FDs with LHS in S, find S-> R S+ = SR, look for FDs with LHS in SR, find R->W S+ = SRW = Relation, so S is a superkey, and a key R+ = R, look for FDs with LHS in R, find R->W R+ = RW, done W+ = W, done Verifying if a given FD is in FD-set closure Computing the closure of a set of FDs (F + ) can be expensive Size of closure is exponential in number of attributes! But if we just want to check if a given FD X Y is in the closure of a set of FDs F: Can be done efficiently without need to know F + Compute X wrt F Check if Y is in X Example: is JDP -> V in F+? Compute JDP + = JDPC by JP -> C JDP + = R by C -> R, so contains V Answer: Yes

Verifying if a given FD is in FD-set closure (another example) But if we just want to check if a given FD X Y is in the closure of a set of FDs F: Can be done efficiently without need to know F + Compute X wrt F Check if Y is in X Example:Relation Hourly(ssn,rating,wage) Given F (1) S->R, (2) R->W Is S->W in F+? Compute S + = SR by (1), = SRW by (2) We see W in S+ Answer: Yes Verifying if attribute set is a key Key verification can also be done with attribute closure To verify if X is a key, two conditions needed: X + = R X is minimal How to test minimality Removing an attribute from X results in X such that X + <> R Normal Forms: next time If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain kinds of problems are avoided/minimized. Simple cases: Consider a relation R with attributes AB No FDs hold: There is no (provable) redundancy, so is BCNF and 3NF Given A B: Several tuples could have the same A value If so, they ll all have the same B value! But a relation is a set of rows, so this can t happen So there are no duplicates in A, and A must be a key Sure A -> B, so AA -> AB, AA = A, so A ->AB, key definition This will turn out to be BCNF and 3NF