






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of database normalization and discusses the second, third, fourth and fifth normal forms. It covers the importance of normalization in eliminating redundancy and ensuring data integrity, and provides examples of how to decompose records to satisfy each normal form.
Typology: Study notes
1 / 12
This page cannot be seen from the preview
Don't miss anything!







3.1 Second Normal Form 3.2 Third Normal Form 3.3 Functional Dependencies 4 FOURTH AND FIFTH NORMAL FORMS 4.1 Fourth Normal Form 4.1.1 Independence 4.1.2 Multivalued Dependencies 4.2 Fifth Normal Form 5 UNAVOIDABLE REDUNDANCIES 6 INTER-RECORD REDUNDANCY 7 CONCLUSION 8 REFERENCES
The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date [4].
The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same number of fields.
First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.
Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.
We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to- one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.
Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS | ====================------------------------------- The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS | ====================----------- =============-------------------- When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term
suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data:
| PERSON | ADDRESS | -------------+-------------------------------- | John Smith | 123 Main St., New York | | John Smith | 321 Center St., San Francisco |
Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.
Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.
| PERSON | ADDRESS | -------------+------------------------- | John Smith | 123 Main St., New York | | John Smith | 123 Main Street, NYC |
We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.
For instance, we as designers know that in the following example there is a single- valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.
| EMPLOYEE | FATHER | FATHER'S-ADDRESS | |============------------+-------------------------------| | Art Smith | John Smith | 123 Main St., New York | | Bob Smith | John Smith | 123 Main Street, NYC | | Cal Smith | John Smith | 321 Center St., San Francisco |
However, in formal terms, there is no functional dependency here between FATHER'S- ADDRESS and FATHER, and hence no violation of third normal form.
Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.
Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.
In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.
Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form. The term "independent" will be discussed after considering an example.
Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as
Instead, they should be represented in the two records
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE | ==================== ======================= Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.
The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:
| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | | | Smith | type | | | Smith | | French | | Smith | | German | | Smith | | Greek |
This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.) (2) A random mix, with three variations: (a) Minimal number of records, with repetitions:
| EMPLOYEE | SKILL | LANGUAGE |
skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed.
In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:
| EMPLOYEE | SKILL | LANGUAGE | |----------+-------+----------| | Smith | cook | French | | Smith | type | French | | Smith | type | German | | Smith | type | Greek |
Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.
4.1.2 Multivalued Dependencies
For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross- product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.
We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:
Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.
We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:
| AGENT | COMPANY | PRODUCT | |-------+---------+---------| | Smith | Ford | car | | Smith | GM | truck |
This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.
But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.
| AGENT | COMPANY | PRODUCT | |-------+---------+---------| | Smith | Ford | car | | Smith | Ford | truck | | Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car |
In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT | |-------+---------| |---------+---------| |-------+---------| | Smith | Ford | | Ford | car | | Smith | car | | Smith | GM | | Ford | truck | | Smith | truck | | Jones | Ford | | GM | car | | Jones | car | ------------------- | GM | truck | -------------------
These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.
Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all
| Smith | GM | | Ford | truck | | Smith | truck | Normal | Jones | Ford | | GM | car | | Jones | car | Form | Brown | Ford | | GM | truck | | Jones | truck | | Brown | GM | | Toyota | car | | Brown | car | | Brown | Toyota | | Toyota | bus | | Brown | bus |
Observe that:
In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.
Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.
The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundanciesæ.
Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION | ============------------- ==============-----------
| EMPLOYEE | LOCATION | ============----------- In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.
Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].
While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed: