CHAPTER 11 Data Normalization
CHAPTER OBJECTIVES How the relational model works How to build use-case models for predicting data usage How to construct entity-relationship diagrams to model your data How to build multi-table databases How joins are used to connect tables How to build a link table to model many-to-many relationships How to optimize your table design for later programming
DESIGNING A DATABASE A relational database stores data in a set of tables each of which stores several pieces of data about a single entity such as books, authors, customers, etc. So, for example, the bookstore database might have tables for books, authors, customers, vendors, orders, and orderlines. The books and authors table schemas might be: books(isbn, title, authorid, yearwritten, price, numberonhand) authors(id, lastname, firstname, address, phone, email, website) An example books record: ('978-0684803357', 'For Whom the Bell Tolls', 15, 1940, 19.80, 3) A (fictional) example authors record: (15, 'Hemingway', 'Ernest', '123 Elm Street, Oak Park, Illinois', '800-555-1212', 'ernest@hemingway.com', 'www.hemingway.com') The beauty of each table containing data about only one entity is that it eliminates redundancy. For example, each time we add a new book by Hemingway, we need to specify that he is the author, but we would not want to have to re-enter his address, phone, etc. as we would if the author data was stored in the books table.
DESIGNING A DATABASE CONT However, with the data stored in different tables, if we want to query the database for a particular title, it's author, and the author's phone number, we need a way to merge the data from the books and authors tables. This we can do in the SELECT statement by connecting each book to it's author with the authorid field as follows: SELECT title, firstname, lastname, phone FROM books, authors WHERE books.authorid = authors.id AND title = 'desiredtitle' The books.authorid field, a foreign key, links each book to its author with the authors.id field, the primary key of the authors table. We say the tables have a relationship. When we connect two tables with a common field like this, it effectively extends each books record with its authors record. The query can thus access fields in both tables. By similar means we can merge more than two tables.
DEFINING RULES FOR A GOOD DATA DESIGN Data developers have come up with a list of rules for creating well-behaved databases: Break your data into multiple tables. Make no field with a list of entries. Do not duplicate data. Make each table describe only one entity. Don t store information that should be calculated instead. Create a single primary key field for each table.
NORMALIZING YOUR DATA The basic concept of normalization is to break down a database into a series of tables. If each of these tables is designed correctly, the database is less likely to have the sorts of problems described so far.
FIRST NORMAL FORM: ELIMINATE LISTED FIELDS A table is in first normal form if and only if it represents a relation. It does not allow nulls or duplicate rows. Eliminate listed fields IE: Specialty field
SECOND NORMAL FORM: ELIMINATE REDUNDANCIES A table is in second normal form (2NF) only if it is in 1NF and all nonkey fields are dependent entirely on the candidate key, not just part of it. The next step is to deal with all the potential redundancy issues. These mainly occur because data is entered more than one time. To fix this, you need to build new tables. The agent table could be further improved by moving all data about operations to another table.
THIRD NORMAL FORM: ENSURE FUNCTIONAL DEPENDENCY A table is in 3NF if it is in 2NF and has no transitive dependencies on the candidate key. For a table to be in the third normal form, that table must have a single primary key and every field in the table must relate only to that key. For example, the description field is a description of the operation, not the agent, so it belongs in the operation table. In the third phase of normalization, you look through each piece of table data and ensure that it directly relates to the table in which it s placed. If not, either move it to a more appropriate table or build a new table for it.
BUILDING YOUR DATA TABLES After designing the data according to the rules of normalization, you are ready to build sample data tables in SQL. It pays to build your tables carefully to avoid problems. Tip: Build all your tables in an SQL script so I can easily rebuild your database if your programs mess up the data structure. And add plenty of sample data in the script.
SQL SCRIPT: HOUSE KEEPING The code specifies the database and deletes all tables if they already existed. This behavior ensures that it start with a fresh version of the data. This is also ideal for testing, since you can begin each test with a database in a known state.
SQL SCRIPT: CREATING THE AGENT TABLE Recall that the first field in a table is usually called the primary key. Primary keys must be unique and each record must have one. I named each primary key according to a special convention. Primary key names always begin with the table name and end with ID. I added this convention because it makes things easier when I write programs to work with this data. The NOT NULL modifier requires you to put a value in the field. In practice, this ensures that all records of this table must have a primary key. The AUTO_INCREMENT identifier is a special tool that allows MySQL to pick a new value for this field if no value is specified. This will ensure that all entries are unique. In fact, when AUTO_INCREMENT is set, you cannot manually add a value to the field. I added an indicator at the end of the CREATE TABLE statement to indicate that agentid is the primary key of the agent table. The FOREIGN KEY reference indicates that the operationid field acts as a reference to the operation table. Some databases use this information to reinforce relationships. Even if the database does not use this information, it can be useful documentation for the purpose of the field.
INSERTING A VALUE INTO THE AGENT TABLE The INSERT statements for the agent table have one new trick made possible by the primary key s AUTO_INCREMENT designation. INSERT INTO agent VALUES( null, 'Bond', 1, '1961-08-30' ); The primary key is initialized with the value null. This might be surprising because primary keys are explicitly designed to never contain a null value. Since the agentid field is set to AUTO_INCREMENT, the null value is automatically replaced with an unused integer.
INTRODUCING SQL FUNCTIONS SQL has a number of functions built in, which allow you to manipulate the data in various ways.
CONVERTING NUMBER OF DAYS TO A DATE Most of the standard math operations work in SQL, but there s a better way. You can convert the number of days back to a date with the FROM_DAYS() function as in Table 11.9. SELECT name, NOW(), birthday, DATEDIFF(NOW(), birthday) as daysold, FROM_DAYS(DATEDIFF(NOW(), birthday)) FROM agent;
CONCATENATING TO BUILD THE AGE FIELD Concatenate values back to one field SELECT name, birthday, CONCAT( YEAR(FROM_DAYS(DATEDIFF(NOW(), birthday))), ' years, ', MONTH(FROM_DAYS(DATEDIFF(NOW(), birthday))), ' months') as age Bond FROM agent;
READING Recommended Reading Textbook Example: Building a View Page 407-419
CODE EXAMPLES FOR THIS CHAPTER The only file that we really need for this chapter is the buildspy.sql SQL script file. Since the ciswebs server will not serve a file with a.sql extension, a copy has been saved as buildspy.sql.txt. A few modifications were made to this file so that it also works on newer versions of MySQL. See the comments at the beginning of the file for more information. ph11withmods.zip is a ZIP folder of the examples, both original and modified versions. Chapter 11 examples