FIT1004 Database Topic 6: Normalisation Learning Objectives: Understand the purpose of normalisation Understand the problems associated with redundant data Identify various types of update anomalies such as insertion, deletion, and modification anomalies Recognise the appropriateness or quality of the design of relations Identify various types of functional dependencies between attributes Understand how functional dependencies can be used to group attributes into relations that are in a known normal form Identify the most commonly used normal forms, namely 1NF, 2NF and 3NF Perform normalisation Understand various ways to refine 3NF relations to achieve better database design Produce an ER diagram from the derived set of 3NF relations References: Rob, P. & Coronel, C., Database Systems, 6 th Edition, Chapt. 5, p. 182 221, 7 th Edition, Chapt. 5, p. 147 174 www.infotech.monash.edu.au/fit1004/
Where are we? Introduction to Database Systems The Relational Model Database Lifecycle Conceptual Design Logical Design Normalisation Physical Design Implementation SQL (DML) SQL (DDL & DCL) Transaction Management Database Administration Data Warehousing & Data Mining 2
Normalisation Normalisation is a technique for producing a set of relations with desirable properties, given the data requirements of an enterprise: Developed by E.F. Codd (1972) Often performed as a series of tests on a relation to determine whether it satisfies or violates the requirements of a given normal form Four most commonly used normal forms are: First (1NF), Second (2NF), Third (3NF) normally sufficient point, and Boyce-Codd (BCNF) 4NF,. etc (required by some very specialised applications) Based on functional dependencies among the attributes of a relation Major aim of relational database design is to group attributes into relations to minimise data redundancy and reduce file storage space required by base relations 3
Why Normalisation is required Note * signifies Project Leader 4
Problems with table in Figure 5.1 PROJ_NUM intended to be primary key, but it contains nulls JOB_CLASS invites entry errors eg. Elec. Eng. vs Elect. Engineer vs E.E. Project relation has redundant data details of a charge per hour are repeated for every occurrence of job class Every time an employee is assigned to a project emp name repeated Relations that contain redundant information may potentially suffer from update anomalies Types of update anomalies include: > Insertion Insert a new employee only if they are assigned to a project > Deletion Delete the last employee assigned to a project? Delete the last employee of a particular job class? > Modification Update a job class hourly rate - need to update multiple rows 5
Functional Dependence An attribute B is FUNCTIONALLY DEPENDENT on another attribute A, if a value of A determines a single value of B at any one time. A B EMP# EMP_NAME CUSTNUMB CUSTNAME ORDER-NUMBER ORDER-DATE > ORDER-NUMBER - independent variable, also know as DETERMINANT > ORDER-DATE - dependant variable TOTAL DEPENDENCY attribute A determines B AND attribute B determines A > EMPLOYEE-NUMBER TAX-FILE-NUMBER 6
Functional Dependence FULL DEPENDENCY occurs when an attribute is always dependant on AT LEAST TWO other attributes ORDER-NUMBER, PART-NUMBER QTY-ORDERED lack of full dependence for multiple attribute key = partial dependence TRANSITIVE DEPENDENCY occurs when Y depends on X, and Z depends on Y - thus Z also depends on X > X Y Z INVOICE-NUMB CUSTOMER-NUMB CUSTOMER-NAME Dependencies are depicted with the help of a Dependency Diagram NORMALISATION - SIMPLY 'COMMON SENSE' Converts a table into tables of progressively smaller degree and cardinality until an optimum level of decomposition is reached - little or no data redundancy exists 7
First Normal Form Positive results from normalisation - amount of space needed to store data will be lower table can be updated with greater efficiency description of database will be straightforward Unnormalised form (UNF) raw data from table/form/grid UNF: PROJECT (proj_num, proj_name (emp_num, emp_name,.)) Figure 5.1 consists of a set of projects with each project having a set of project-employee details (model 1) FIRST NORMAL FORM (part of formal definition of relation) A TABLE IS IN FIRST NORMAL FORM (1NF) IF - > it is a valid table (in particular no repeating groups) > a unique key has been identified for each row > all attributes are functionally dependant on all or part of the key 1NF: PROJECT (proj_num, proj_name) 1NF: ASSIGN (proj_num, emp_num, emp_name, job_class, chg_hour, assign_hours) 8
UNF to 1NF transformation Identify the repeating group(s), if any, in the unnormalised relation Move from UNF to 1NF by removing repeating group along with the PK of the main relation Important property of normalisation decomposition Lossless-join property enables us to find any instance of the original relation from corresponding instances in the smaller relations hence must extract PK of main relation Determine PK of new relations created extracted repeating group will normally have a composite PK including the main relations PK > but NOT always, PK of main relation may simply act as a FK INSURED (comp_code, comp_name (insured_id, insured_name,..))» COMPANY (comp_code, comp_name)» INSURED (insured_id, comp_code,insured_name,..) 9
First Normal Form continued An alternative way (model 2) of looking at this scenario Present data in tabular format, where each cell has single value and there are no repeating groups Eliminate repeating groups, eliminate nulls by making sure that each repeating group attribute contains an appropriate data value 10
Model 2: Dependency Diagram (1NF) 11
1NF to 2NF A RELATION IS IN 2NF IF - all non key attributes are functionally dependent on the entire key > ie. no partial dependencies exist Model 1: Move from 1NF to 2NF by removing partial dependencies 1NF: PROJECT (proj_num, proj_name) 1NF: ASSIGN (proj_num, emp_num, emp_name, job_class, chg_hour, assign_hours) 1NF: PROJECT (proj_num, proj_name) already in 2NF only one attribute in PK thus CANNOT be any partial dependencies > 2NF: PROJECT (proj_num, proj_name) 1NF: ASSIGN (proj_num, emp_num, emp_name, job_class, chg_hour, assign_hours) becomes > 2NF EMPLOYEE (emp_num, emp_name, job_class, chg_hour) > 2NF ASSIGN (proj_num, emp_num, assign_hours) 12
2NF Conversion Results (Model 1 & 2) Note Model 1 & 2 now equivalent 13
2NF to 3NF A RELATION IS IN 3NF IF - all transitive dependencies have been removed - check for non key attribute dependant on another non key attribute Move from 2NF to 3NF by removing transitive dependencies 2NF: PROJECT (proj_num, proj_name) 2NF EMPLOYEE (emp_num, emp_name, job_class, chg_hour) 2NF ASSIGN (proj_num, emp_num, assign_hours) PROJECT and ASSIGN already in 3NF 3NF: PROJECT (proj_num, proj_name) 3NF ASSIGN (proj_num, emp_num, assign_hours) 2NF EMPLOYEE (emp_num, emp_name, job_class, chg_hour) 3NF EMPLOYEE (emp_num, emp_name, job_class) 3NF JOB (job_class, chg_hour) 14
3NF Conversion Results 15
Improving the Design To improve the design of the database the following changes could be made: PK assignment Naming conventions Attribute atomicity Adding attributes Adding relationships Refining PKs Maintaining historical accuracy Using derived attributes 16
Improving the Design continued Returning to Table 5.1 (slide 4) Data loss who is the project leader? > modify project (R&C approach) 3NF: PROJECT (proj_num, proj_name, emp_num) > Alternative, add emp_num at UNF > Do not use synonyms when naming attributes always use the same name for the same attribute eg. Do not make emp_num in PROJECT leader_num JOB (job_class, chg_hour) > Job_class is a string eg. Systems Analyst Redundant data with associated issues, poor PK Better to create job code > modify job (R&C approach) 3NF JOB (job_code, job_description, job_chg_hour) > Alternative, make changes at UNF 17
Completed Database 18
Completed Database continued 19
Entire Process UNF to 3NF UNF PROJECT (proj_num, proj_name, emp_num (emp_num, emp_name, job_code, job_description, job_chg_hour, assign_hours)) 1NF remove repeating group and identify PK PROJECT (proj_num, proj_name, emp_num) ASSIGN (proj_num, emp_num, emp_name, job_code, job_description, job_chg_hour, assign_hours) 2NF remove partial dependencies PROJECT (proj_num, proj_name, emp_num) EMPLOYEE (emp_num, emp_name, job_code, job_description, job_chg_hour) ASSIGN (proj_num, emp_num, assign_hours) 3NF remove transitive dependencies PROJECT (proj_num, proj_name, emp_num) EMPLOYEE (emp_num, emp_name, job_code) ASSIGN (proj_num, emp_num, assign_hours) JOB (job_code, job_description, job_chg_hour) Note R&C show some further 'suggested' improvements 20
Normalisation presented as a Conceptual ERD 21
Normalisation presented as a Logical ERD 22
Normalisation and Database Design Normalisation should be part of design process Make sure that proposed entities meet required normal form before table structures are created ER diagram Provides the big picture, or macro view, of an organization s data requirements and operations Created through an iterative process > Identifying relevant entities, their attributes and their relationship > Use results to identify additional entities and attributes normalisation procedures Focus on the characteristics of specific entities A micro view of the entities within the ER diagram Difficult to separate normalisation process from ER modeling process Two techniques should be used concurrently 23
Normalisation and ER Diagrams ER Diagramming Top down approach Fast Examine requirements Business knowledge Normalisation Bottom up approach Very slow Examine existing data Mathematically based Top down create - bottom up checking Accuracy Greater understanding of the data 24
Summary This lecture Understand the purpose of normalisation Understand the problems associated with redundant data Identify various types of update anomalies such as insertion, deletion, and modification anomalies Recognise the appropriateness or quality of the design of relations Identify various types of functional dependencies between attributes Understand how functional dependencies can be used to group attributes into relations that are in a known normal form Identify the most commonly used normal forms, namely 1NF, 2NF and 3NF Perform normalisation Understand various ways to refine 3NF relations to achieve better database design Produce an ER diagram from the derived set of 3NF relations Next lecture Structured Query Language (SQL) - DML 25