Database Normalization: A Comprehensive Guide to Minimizing Data Redundancy, Study notes of Database Management Systems (DBMS)

Normalization Process in 1NF, 2NF and 3NF

Typology: Study notes

2017/2018

Uploaded on 04/13/2018

jeremy-kabaya
jeremy-kabaya 🇰🇪

5

(1)

4 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
NORMALIZATION
Normalization is a process for evaluating and correcting table
structures to minimize data redundancies, thereby reducing the
likelihood of data anomalies. The normalization process involves
assigning attributes to tables based on the concept of determination.
Normalization works through a series of stages called normal forms.
The rst three stages are described as rst normal form (1NF), second
normal form (2NF), and third normal form (3NF). From a structural point
of view, 2NF is better than 1NF, and 3NF is better than 2NF. For most
purposes in business database design, 3NF is as high as you need to
go in the normalization process.
THE NEED FOR NORMALIZATION
To get a better idea of the normalization process, consider the
simplied database activities of a construction company that manages
several building projects. Each project has its own project number,
name, employees assigned to it, and so on. Each employee has an
employee number, name, and job classication, such as engineer or
computer technician.
The company charges its clients by billing the hours spent on each
contract. The hourly billing rate is dependent on the employee’s
position. For example, one hour of computer technician time is billed at
a dierent rate than one hour of engineer time.
FIGURE 5.1 Tabular representation of the report format
Note that the data in Figure 5.1 reects the assignment of employees
to projects. Apparently, an employee can be assigned to more than one
project. For example, Darlene Smithson (EMP_NUM = 112) has been
assigned to two projects: Amber Wave and Star ight. Given the
structure of the data set, each project includes only a single
occurrence of any one employee.
Therefore, knowing the PROJ_NUM and EMP_NUM value will let you nd
the job classication and its hourly charge. In addition, you will know
the total number of hours each employee worked on each project.
Unfortunately, the structure of the data set in Figure 5.1 does not
conform to the requirements discussed in Chapter 3, nor does it handle
data very well. Consider the following deciencies:
1. The project number (PROJ_NUM) is apparently intended to be a
primary key or at least a part of a PK, but it contains nulls. (Given the
preceding discussion, you know that PROJ_NUM + EMP_NUM will dene
each row.)
pf3
pf4
pf5

Partial preview of the text

Download Database Normalization: A Comprehensive Guide to Minimizing Data Redundancy and more Study notes Database Management Systems (DBMS) in PDF only on Docsity!

NORMALIZATION

Normalization is a process for evaluating and correcting table structures to minimize data redundancies, thereby reducing the likelihood of data anomalies. The normalization process involves assigning attributes to tables based on the concept of determination. Normalization works through a series of stages called normal forms. The first three stages are described as first normal form (1NF), second normal form (2NF), and third normal form (3NF). From a structural point of view, 2NF is better than 1NF, and 3NF is better than 2NF. For most purposes in business database design, 3NF is as high as you need to go in the normalization process. THE NEED FOR NORMALIZATION To get a better idea of the normalization process, consider the simplified database activities of a construction company that manages several building projects. Each project has its own project number, name, employees assigned to it, and so on. Each employee has an employee number, name, and job classification, such as engineer or computer technician. The company charges its clients by billing the hours spent on each contract. The hourly billing rate is dependent on the employee’s position. For example, one hour of computer technician time is billed at a different rate than one hour of engineer time.

FIGURE 5.1 Tabular representation of the report format Note that the data in Figure 5.1 reflects the assignment of employees to projects. Apparently, an employee can be assigned to more than one project. For example, Darlene Smithson (EMP_NUM = 112) has been assigned to two projects: Amber Wave and Star flight. Given the structure of the data set, each project includes only a single occurrence of any one employee. Therefore, knowing the PROJ_NUM and EMP_NUM value will let you find the job classification and its hourly charge. In addition, you will know the total number of hours each employee worked on each project.

Unfortunately, the structure of the data set in Figure 5.1 does not conform to the requirements discussed in Chapter 3, nor does it handle data very well. Consider the following deficiencies:

  1. The project number (PROJ_NUM) is apparently intended to be a primary key or at least a part of a PK, but it contains nulls. (Given the preceding discussion, you know that PROJ_NUM + EMP_NUM will define each row.)
  1. The table entries invite data inconsistencies. For example, the JOB_CLASS value “Elect. Engineer” might be entered as “Elect.Eng.” in some cases, “El. Eng.” in others, and “EE” in still others.
  2. The table displays data redundancies. Those data redundancies yield the following anomalies: a. Update anomalies. Modifying the JOB_CLASS for employee number 105 requires (potentially) many alterations, one for each EMP_NUM = 105. b. Insertion anomalies. Just to complete a row definition, a new employee must be assigned to a project. If the employee is not yet assigned, a phantom project must be created to complete the employee data entry. c. Deletion anomalies. Suppose that only one employee is associated with a given project. If that employee leaves the company and the employee data are deleted, the project information will also be deleted. To prevent the loss of the project information, a fictitious employee must be created just to save the project information.

In spite of those structural deficiencies, the table structure appears to work; the report is generated with ease. Unfortunately, the report might yield varying results depending on what data anomaly has occurred. For example, if you want to print a report to show the total “hours worked” value by the job classification “Database Designer,” that report will not include data for “DB Design” and “Database Design” data entries. Such reporting anomalies cause a multitude of problems for managers—and cannot be fixed through applications programming. Even if very careful data entry auditing can eliminate most of the reporting problems (at a high cost), it is easy to demonstrate that even a simple data entry becomes inefficient. Given the existence of update anomalies, suppose Darlene M. Smithson is assigned to work on the Evergreen project. The data entry clerk must update the PROJECT file with the entry:

15 Evergreen 112 Darlene M. Smithson DSS Analyst $45.95 0.

to match the attributes PROJ_NUM, PROJ_NAME, EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR, and HOURS. (When Ms. Smithson has just been assigned to the project, she has not yet worked, so the total number of hours worked is 0.0.)

Each time another employee is assigned to a project, some data entries (such as PROJ_NAME, EMP_NAME, and CHG_HOUR) are unnecessarily repeated. Imagine the data entry chore when 200 or 300

Before outlining the normalization process, it’s a good idea to review the concepts of determination and functional dependency. Functional Dependency Concepts

CONCEPT DEFINITION

Functional dependency The attribute B is fully functionally dependent on the attribute A if each value of A determines one and only one value of B. Example: PROJ_NUM → PROJ_NAME (read as PROJ_NUM functionally determines PROJ_NAME) In this case, the attribute PROJ_NUM is known as the determinant attribute and the attribute PROJ_NAME is known as the dependent attribute. Functional dependency (generalized definition)

Attribute A determines attribute B (that is, B is functionally dependent on A ) if all of the rows in the table that agree in value for attribute A also agree in value for attribute B. Fully functional dependency (composite key)

If attribute B is functionally dependent on a composite key A but not on any subset of that composite key, the attribute B is fully functionally dependent on A.

It is crucial to understand these concepts because they are used to derive the set of functional dependencies for a given relation. The normalization process works one relation at a time, identifying the dependencies on that relation and normalizing the relation. As you will see in the following sections, normalization starts by identifying the dependencies of a given relation and progressively breaking up the relation (table) into a set of new relations (tables) based on the identified dependencies.

Conversion to First Normal Form Because the relational model views data as part of a table or a collection of tables in which all key values must be identified, the data depicted in Figure 5.1 might not be stored as shown. Note that Figure 5.1 contains what is known as repeating groups. A repeating group derives its name from the fact that a group of multiple entries of the same type can exist for any single key attribute occurrence. In Figure 5.1, note that each single project number (PROJ_NUM) occurrence can reference a group of related data entries. For example, the Evergreen

project (PROJ_NUM = 15) shows five entries at this point—and those entries are related because they each share the PROJ_NUM = 15 characteristic. Each time a new record is entered for the Evergreen project, the number of entries in the group grows by one. A relational table must not contain repeating groups. The existence of repeating groups provides evidence that the RPT_FORMAT table in Figure 5.1 fails to meet even the lowest normal form requirements, thus reflecting data redundancies.

Normalizing the table structure will reduce the data redundancies. If repeating groups do exist, they must be eliminated by making sure that each row defines a single entity. In addition, the dependencies must be identified to diagnose the normal form. Identification of the normal form will let you know where you are in the normalization process. The normalization process starts with a simple three-step procedure. Step 1: Eliminate the Repeating Groups Start by presenting the data in a tabular format, where each cell has a single value and there are no repeating groups. To eliminate the repeating groups, eliminate the nulls by making sure that each repeating group attribute contains an appropriate data value. That change converts the table in Figure 5.1 to 1NF in Figure 5.2.

Figure 5.2 A table in first normal form

Step 2: Identify the Primary Key The layout in Figure 5.2 represents more than a mere cosmetic change. Even a casual observer will note that PROJ_NUM is not an adequate primary key because the project number does not uniquely identify all of the remaining entity (row) attributes. For example, the PROJ_NUM value 15 can identify any one of five employees. To maintain a proper primary key that will uniquely identify any attribute value, the new key must be composed of a combination of PROJ_NUM and EMP_NUM. For example, using the data shown in Figure 5.2, if you know that PROJ_NUM = 15 and EMP_NUM = 103, the entries for the attributes PROJ_NAME, EMP_NAME, JOB_CLASS, CHG_HOUR, and HOURS must be Evergreen, June E. Arbough, Elect. Engineer, $84.50, and 23.8, respectively. Step 3: Identify All Dependencies The identification of the PK in Step 2 means that you have already identified the following dependency:

PROJ_NUM, EMP_NUM PROJ_NAME, EMP_NAME, JOB_CLASS, CHG_HOUR, HOURS