ACS-2914 Normalization March 2009 NORMALIZATION 2. Ron McFadyen 1. Normalization 3. De-normalization 3

NORMALIZATION 2 Normalization 3 De-normalization 3 Functional Dependencies 4 Generating functional dependency maps from database design maps 5 Anomalies 8 Partial Functional Dependencies 10 Transitive Functional Dependencies 11 1NF 13 2NF 15 3NF 17 BCNF 19 Ron McFadyen 1

Normalization Normalization is concerned with the structure of relations in a relational database. There are several normal forms of which 1NF, 2NF, 3NF and BCNF are the first and most important for practical OLTP database design. OLTP databases are databases that are used in online transaction processing environments - ones used heavily by business. Transactions are typically those units of work that are the goals of system users; for instance in a banking environment we would expect to find a deposit transaction, a withdrawal transaction, a transfer transaction, a balance lookup transactions. A banking system could have thousands of users and we expect transactions such as these to be efficiently executed. Generally speaking, normalized databases lead to the most efficient designs for these types of transactions. 1NF, 2NF, 3NF and BCNF are acronyms for first, second, third, and Boyce-Codd normal forms. There is a sequence to normal forms: 1NF is considered the weakest, 2NF is stronger than 1NF, 3NF is stronger than 2NF, and BCNF is considered the strongest of these four normal forms. Also, any relation that is in BCNF, is in 3NF; any relation in 3NF is in 2NF; and any relation in 2NF is in 1NF. Sometimes this correspondence is shown as: In these notes, as well as describing normal forms, we discuss two related processes: normalization and de-normalization. Normalization refers to a process that improves a database design by generating relations that are of higher normal forms. Denormalization is another process that combines relations in a higher normal form to produce relations in a lower normal form. The objective of normalization is sometimes stated: to create relations where every dependency is on the key, the whole key, and nothing but the key. A relation that is fully normalized is about a single concept such as a student entity set, a course entity set, and so on. We consider higher normal forms to be better. The reason that a relation in a higher normal form is better than one in a lower normal form is because update semantics for the affected data are simplified. This means that applications required to maintain the database are simpler. In general, we consider fully normalized relations (BCNF) easier to maintain, but fully normalized relations do require more work to be expended when Ron McFadyen 2

retrieving data. This means that retrieving information becomes more costly, more timeconsuming. To understand 2NF, 3NF, and BCNF we require a solid understanding of functional dependencies. In the following we discuss Normalization De-normalization Functional Dependencies Anomalies Partial Dependencies Transitive Dependencies 1NF 2NF 3NF BCNF Normalization Normalization is a process that changes a relation from a lower to a higher normal form. We say that we decompose a relation into more relations in such a way as to preserve the original information and reduce redundancy of data. Reducing redundant data increases the number of relations, but makes the data easier to maintain. De-normalization De-normalization is a process that changes relations from higher to lower normal forms, and hence generates redundant data in the tuples of a relation. This is normally done for performance reasons in order to reduce the cost of retrieving information from the database. Ron McFadyen 3

Functional Dependencies To understand normalization theory that applies to first, second, third and Boyce-Codd normal forms, we must understand what is meant by the term functional dependency. There is another type of dependency called a multi-valued dependency but that is important to the understanding of higher normal forms than covered in these notes. A functional dependency is an association between two attributes. We say there is a functional dependency from attribute A to an attribute B if and only if for each value of A there can be at most one value for B. We can illustrate this by writing A functionally determines B, or B is functionally determined by A, or by a drawing such as: A B When we have a functional dependency from A to B we refer to attribute A as the determinant. We can easily demonstrate this property by example. For instance, suppose we consider some attributes from a university environment. Universities assign each student a student number and each student receives a number that is different from that assigned to any other student. Universities store information about their students such as first name, last name, and birth date. Let us assume that for each student we will have exactly one first name, one last name, and one birth date. We can say then, that there are three functional dependencies that we depict below: student number first name last name birth date Figure 1. Traditional drawing of functional dependencies A drawing like the one above is also a concept map where every linking phrase from one concept to another is the phrase functionally determines, as shown below. Figure 2. Functional dependencies as a concept map Ron McFadyen 4

We shall refer to a concept map that depicts functional dependencies as a determinant map. In earlier chapters we discussed concept maps and entity relationship modeling. In particular we have studied the use of link phrases is identified by and is described by. Consider a situation where we have the concepts: student, student number, first name, last name, and birth date, and where a student is identified by student number and is described by first name, last name, birth date. In this situation it must be that student number functionally determines first name, last name, and birth date. Note that its possible to generate a dependency map from a database design map: Figure 3. Design map transforms to a determinant map So the concept of functional dependency is not really new to us; rather, we are just using a term to refer a property derivable from a design map (concept map). Generating functional dependency maps from database design maps Given that we have a design map we can easily generate a determinant map. Let us asssume at first that our design map is well-formed and does not have any composite attributes. First, we distinguish entity concepts and attribute concepts. Recall that we considered these concept types when we discussed generating an entity-relationship diagram from a design map. Entity concepts are those concepts that either have no linking phrase leading to them, or concepts that do not have a linking phrase appearing in the table (from Section 1.6). leading to them. Concepts that are not entity concepts are attribute concepts. We subdivide attribute concepts into two groups: identifying concepts and descriptive concepts. Identifying concepts are those concepts that have the link phrase is identified by leading to them; all other attribute concepts are descriptive concepts. We also distinguish relationship links in the design map. Recall that these links are those links where the linking phrase does not appear in the table of Section 1.6. Note that relationship links are links that only exist between entity concepts in the design map. Our determinant map will include only identifying and descriptive concepts. There are two transformation rules, one that concerns attribute concepts and the other that concerns the entity concepts that are linked via a relationship link phrase. Ron McFadyen 5

1. For each entity concept: a) For each identifying concept that is not in the determinant map, place it in the map. b) For each descriptive concept that is not in the determinant map, place it in the map. c) For each identifying concept, draw a link with the phrase functionally determines to each descriptive attribute. 2. For each relationship link: a) draw a link with the phrase functionally determines between the related identifying concepts and directed towards the attribute on the one side of the link. Consider an example involving employees and departments: We have two entity concepts employee and department that are related through a relationships link works in which is interpreted that an employee works in one department. I.e. the department concept is on the one side of the link. See the Figure 4. Figure 4. A design map In Figure 5 we trace the steps creating a determinant map from a design map. Rule 1 is applied twice, once for employee and once for department Rule 2 is applied for the works in link Ron McFadyen 6

When we apply Rule 1 for employee we have: and after applying Rule 1 for department we have: and finally after applying rule 2 we have the complete map: Figure 5. Generating a determinant map Ron McFadyen 7

Anomalies An anomaly is a variation that differs in some way from what is considered normal. With respect to maintaining a database, we consider what must occur when a database record is updated, inserted, or deleted. In database applications where these update, insert, and/or delete operations are common (e.g. OLTP databases), it is desirable for these operations to be as straightforward and as efficient as possible. When relations are not fully normalized they exhibit update anomalies (because the basic operations are not as efficient as possible). When relations are not fully normalized, some aspect of the relation will be awkward to maintain. Usually, the design goal for an OLTP database is that it be easy to understand and to maintain. In particular, if the value of one attribute for an entity must be changed, then ideally, that change requires only one record to be updated. If only one record changes, then the cost or time of performing the update is predictable and minimal. Consider the relation structure and sample records: deptnum coursenum studnum Grade studgpa 92 101 3344 A 3.50 92 115 7654 A 3.00 81 101 7654 C 3.00 92 226 3344 B 3.50 This relation is used for keeping track of the students enrolled in courses, the grade assigned to the student for the course, and (oddly) the student s overall grade point average. What must happen if a student s gpa changes? We always want our databases to have correct information, and so the gpa must change in several records, not just one record. We refer to this type of difficulty as an update anomaly the simple change of a student s gpa affects, not just one record, but potentially several records in the database. The update operation is more complex than necessary, and this means it is more expensive to do, resulting in slower performance. In this case, which attributes constitute the primary key? The primary key is {DeptNum, CourseNum, StuNum}. Now, we ll consider delete and insert anomalies. For these examples, assume that a student s gpa is only stored in this relation. Suppose we happen to delete all rows relating to student 3344. What happens to the student s gpa information? We lost it! As you probably know, this design is poor perhaps we should never mix concepts, storing student information with enrolment information! Because we assumed that the gpa is only stored in this relation, this is an example of a deletion anomaly. Next, we consider an insertion anomaly. Ron McFadyen 8

Suppose we add a new student (and assume a new student's GPA is 0). How do we add this information (i.e. insert a new record) with the database structure we have? We can t! Before we could add a row to this relation, we need a course number too. As you can tell, we have made the management of data more difficult with this design. If a database were to exist with a table like this, the designers may have used a special course number (say course 0) to represent the situation we have just considered. That type of rule is something we do not recommend. The previous discussion concerning anomalies highlights some of the data management issues that arise when a relation is not fully normalized. Another way of describing the general problem here, as far as updating a database is concerned, is that redundant data makes it more complicated for us to keep the data consistent. In the example we have used, the GPA for a student is stored redundantly (repeatedly), the same value for the same student appears in several rows. Ron McFadyen 9

Partial Functional Dependencies Consider a relation with department number, department name, course number and course title attributes. The FDs are shown below. department number course number department name course title In order to identify any row in the corresponding relation, it would be necessary to know both the department number and the course number; department number and course number form the PK. Note the functional dependency of department name on department number. If two or more rows in the relation have the same value for department number, they must have the same value for department name. Observe this redundancy is due to the FD of department name on department number. Because department name is dependent on department number, it is also dependent on department number and course number. Because department name is dependent on part of the PK, we call this a partial dependency In general, if we have the dependencies A B C D we say that C is partially dependent on {A,B}, i.e. on the PK. Exercise: Consider a relation that describes a section of a course. Suppose the PK is {dept number, course number, section number} and we also have the attributes: room, time, instructor number, course title. What FDs would exist here? Is there a partial dependency? Ron McFadyen 10

Transitive Functional Dependencies Consider a relation that describes a couple of concepts, say an instructor and a department: Instructor number Instructor name Office Department number Department name 33 Joe 3D15 81 B&A 44 Joe 3D16 92 ACS 45 April 3D17 92 ACS 50 Susan 3D17 92 ACS 21 Peter 3D18 81 B&A 22 Peter 3D18 32 MATH As instructor number is the primary key, we have the following FDs: instructor number instructor name office department number department name Suppose we also have the FD: department number determines department name. Now our FD diagram becomes: instructor number instructor name office department number department name and we say the FD from instructor number to department is transitive via department number. Ron McFadyen 11

In general, if we have A determines B and B determines C, then we say that A transitively determines C. A B C Note: In the above, B and C are non-key attributes. If the diagram above also had the functional dependency "B determines A" (and so A and B are candidate keys) we would not say that A transitively determines C. Exercise: Consider a relation that describes a section of a course. Suppose the PK is {dept number, course number, section number} and we also have the attributes: room, time, instructor number, instructor name. Suppose instructor name is the name of the instructor identified by instructor number. What FDs would exist here? Is there a transitive dependency? Ron McFadyen 12

1NF We say a relation is in 1NF if all values stored in the relation are single-valued and atomic. With this rule, we are simplifying the structure of a relation; we are simplifying the kinds of values that are stored in the relation. In fact some definitions you may encounter for relation, state or imply that for something to be a relation it must be in first normal form; first normal form is built into the definition of relation. Consider the following EmployeeDegrees relation. Since the degrees attribute is a multivalued attribute (degrees holds all the degrees that an Employee has earned), the relation is not 1NF. EmployeeDegrees empno Name Salary Degrees 111 Joe 29,000 BSc, MSc 200 April 41,000 BA, MA 205 Peter 33,000 BEng 210 Joe 20,000 emplno is the PK each employee has one name and one salary each employee has zero or more degrees stored as one attribute value Note that the two relations below are each in 1NF: Employee empno Name Salary 111 Joe 29,000 200 April 41,000 205 Peter 33,000 210 Joe 20,000 emplno is the PK each employee has one name and one salary Degree empno Degree 111 BSc, MSc 200 BA, MA 205 BEng {empno, degree} is the PK Ron McFadyen 13

Exercises: 1. Given the above EmployeeDegrees relation, what must be done to create 1NF relations? 2. What ERD would you create where you have employees with attributes employee number, name, salary, and degrees, where: the degrees attribute can have more than one value for a single employee. employee number uniquely identifies an employee Ron McFadyen 14

2NF 2NF (and 3NF) both involve the concepts of key and non-key attributes. 2NF is where partial dependencies play an important role. A key attribute is any attribute that is part of a key; any attribute that is not a key attribute is a non-key attribute. Our first statement of 2NF is: A relation is in 2NF if it is in 1NF, and every non-key attribute is fully dependent on the primary key. We ll revisit our definition at the end of this section. Consider the following relation and FDs. There are 2 key attributes and 2 non-key attributes. One of these non-key attributes, gpa, is dependent on StuNum. In this case we have a partial dependency. gpa is partially dependent on the primary key. Note that a relation such as this will have redundant data: a student's gpa is repeated in every row for the student. stunum course grade gpa 111 2914 A 2.6 113 2914 B 3.5 113 3902 A 3.5 {stunum, coursenum} is the PK For each course a student receives a grade Each student has a gpa that is calculated using all the grades received by the student When we have a relation such as the above, we can easily split the relation into two (in general, two or more) relations that will both be in 2NF, and where (importantly) we have not lost any information. Consider the following two relations that can be joined on stunum to present to an end-user the original information. We say that we are decomposing the original relation losslessly into two other relations. Student stunum course grade 111 2914 A 113 2914 B 113 3902 A StudentGPA stunum gpa 111 2.6 113 3.5 Ron McFadyen 15

When we recognize a relation is not in 2NF, it is because of one or more partial dependencies. When we decompose to form 2NF relations, we remove partial dependencies and eliminate a source of redundant data. We have ensured that the nonkey attributes describe the whole key. To make our definition of 2NF more precise we relate full dependence to candidate keys: A relation is in 2NF if it is in 1NF, and every non-key attribute is fully dependent on each candidate key. Exercise: Consider a relation that describes a section of a course. Suppose the PK is {dept number, course number, section number} and we also have the attributes: room, time, instructor number, course title. Suppose course title is the title of the course identified by dept number and course number. What FDs would exist here? Is there a partial dependency? Ron McFadyen 16

3NF Third normal form involves the concepts of candidate key, non-key attribute and transitive dependency. We say a relation is in 3NF if the relation is in 1NF and all determinants of non-key attributes are candidate keys. Candidate keys are the collection of attributes that uniquely identify tuples in the relation. Recall that we choose a PK from the collection of candidate keys. For example, suppose we have an employee relation with empnum, empname, deptnum, deptname, and with the functional dependencies shown below. We are assuming each employee has one name, works in one department, and each department has one name. empnum empname deptnum deptname We shall assume that the relation is in 1NF. The only candidate key is EmpNum and so it is the primary key too. The relation satisfies the requirements for 2NF. Is the relation in 3NF? No, it is not in 3NF because of the transitive dependency of deptname on empnum via deptnum; deptname is dependent on deptnum and deptnum is not a candidate key. Note that any instance of this relation will have redundant data: the department name will be the same in every row that has the same department number. To achieve 3NF, we decompose the relation into two relations. We replace the given relation by two or more other relations in such a way that each new relation is in 3NF and there is no loss of information. Consider the following decomposition: empnum empname deptnum deptnum deptname When we decompose a relation that is not in 3NF into 3NF relations, we are removing any unwanted transitive dependencies from some relation. We are ensuring that our relations only have non-key attributes that describe an entity represented by the primary key, and nothing but the primary key. Ron McFadyen 17

Exercise: Consider a relation that describes a section of a course. Suppose the PK is {dept number, course number, section number} and we also have the attributes: room, time, instructor number, instructor name. Suppose instructor name is the name of the instructor identified by instructor number. What FDs would exist here? Is there a transitive dependency? Ron McFadyen 18

BCNF BCNF can be defined very simply: a relation is in BCNF if it is in 1NF and if every determinant is a candidate key. First, notice the slight difference between BCNF and 3NF. 3NF limits us to consider only determinants of non-key attributes whereas BCNF considers the determinants of all attributes. Note that BCNF is a stronger normal form - if a relation is in BCNF it must also be in 3NF. Consider an example of a relation that is in 3NF, but not BCNF: stunum deptnum datefirstvisit datelastvisit staffid The business rules leading to the FDs are: For each department that a student is enrolled in courses, the student must choose an advisor who is a staff member (identified by staffid). For each combination of student and department there is one advisor (staff member), a date of first visit, and a date of the last visit of the student to the advisor. Each advisor is identified by their staff id, and each advisor works in exactly one department. A department will have many staff members/advisors. From the above statements, you can deduce another set of FDs. Note the following FDs must exist too: stunum, staffid determine datefirstvisit stunum, staffid determine datelastvisit There are two candidate keys: stunum, deptnum stunum, staffid When you analyze for 2NF and 3NF, you focus on non-key attributes. These are datefirstvisit and datelastvisit; all other attributes are part of a candidate key. There are no partial dependencies and there are no transitive dependencies, and so this relation is in 2NF and 3NF. For a relation to be in BCNF, all determinants must be candidate keys. Since staffid is a determinant but not a candidate key, the relation is not in BCNF. We must generate a Ron McFadyen 19

decomposition to create BCNF relations where no information is lost. The FD, "staffid determines deptnum", is important here as it indicates that we are mixing staff information and advising information in the same relation. Consider the following decomposition: stunum staffid datefirstvisit datelastvisit deptnum staffid A query could join these two tables and produce the same information as seen in the original table. Ron McFadyen 20