Normalization - Data Warehousing - Lecture Handouts, Lecture notes of Data Warehousing

Normalization, Goals of normalization, Result of normalization, Levels of normalization, Purist approach of normalization, Student database system, Update Anomalies are discusses points of document. This handout is for data warehousing course from Virtual University of Pakistan.

Typology: Lecture notes

2011/2012

Uploaded on 11/06/2012

ahsen
ahsen 🇵🇰

4.6

(88)

84 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Warehousing Course Code: CS614
Virtual University of Pakistan
50
Lecture Handout
Data Warehousing
Lecture No. 06
Normalization
Before we begin our discussion about the normal forms, it's important to understand that
what we will discuss are guidelines and guidelines only i.e. not a dejure standard.
Sometimes (especially in a DSS environment), it becomes necessary to drift from this
purist approach to meet practical business requirements of performance. However, when
deviation take place, it's extremely important to evaluate any possible consequences these
deviations would cause and the corresponding inconsistencies and anomalies this is what
de-normalization is about.
What is normalization?
What are the goals of normalization?
Eliminate redundant data.
Ensure data dependencies make sense.
What is the result of normalization?
What are the levels of normalization?
Always follow purist’s approach of normalization?
NO
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Normalization - Data Warehousing - Lecture Handouts and more Lecture notes Data Warehousing in PDF only on Docsity!

[email protected]

Virtual University of Pakistan

Lecture Handout

Data Warehousing

Lecture No. 06

Normalization

Before we begin our discussion about the normal forms, it's important to understand that what we will discuss are guidelines and guidelines only i.e. not a dejure standard. Sometimes (especially in a DSS environment), it becomes necessary to drift from this purist approach to meet practical business requirements of performance. However, when deviation take place, it's extremely important to evaluate any possible consequences these deviations would cause and the corresponding inconsistencies and anomalies this is what de-normalization is about.

What is normalization?

What are the goals of normalization?

ƒ Eliminate redundant data. ƒ Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

Always follow purist’s approach of normalization? NO

[email protected]

Virtual University of Pakistan

Normalization is the process of efficiently organizing data in a database by decomposing (splitting) a relational table into smaller tables by projection. There are basically two goals of normalization as follows:

  1. Eliminate redundant data (for example, storing the same data in more than one table)
  2. Ensuring data dependencies make sense (only storing related data in a table).

Both of these are worthy goals, as they reduce the amount of space a database consumes, and ensure that data is logically stored and is in third normal form (3NF). The database community has developed a series of guidelines or benchmarks for ensuring that databases are indeed normalized. These goals are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). The first two normal forms are intermediate steps to achieve the goal of having all tables in 3NF. Note that in order to be correct, decomposition must be lossless i.e. the new tables can be recombined by a natural join to recreate the original table without creating any false or redundant data and without any loss of any data/information.

Consider a student database system to be developed for a multi-campus university, such that it specializes in one degree program at a campus i.e. BS, MS or PhD.

[email protected]

Virtual University of Pakistan

Table FIRST Figure-6.2: Part of example student database system in 1NF

Although the table is shown in 1NF i.e. it contains atomic values, there are no repeating values, there is no aggregation, yet it contains redundant data. For example, information about the student’s degree, and campus location is repeated for every course taken. Redundancy causes what are called update anomalies. Update anomalies are problems that arise when records are inserted, deleted, or updated in the database. For example, the following anomalies can occur:

Normalization: 1NF Update Anomalies

INSERT. Certain student with SID 5 got admission in a different campus (say) Karachi cannot be added until the student registers for a course.

DELETE. If student graduates and his/her corresponding record is deleted, then all information about that student is lost.

UPDATE. If student migrates from Islamabad campus to Lahore campus (say) SID = 1, then six rows would have to be updated with this new information.

[email protected]

Virtual University of Pakistan

INSERT. The fact that a certain student with SID 5 got admission in a different campus (say) Karachi cannot be added until the student registers for a course.

DELETE. If student graduates and his/her corresponding record is deleted, then all information about that student is lost.

UPDATE. If student migrates from Islamabad campus to Lahore campus (say) SID = 1, then six rows would have to be updated with this new information.

Normalization: 2NF

Every non-key column is fully dependent on the PK.

FIRST is in 1NF but not in 2NF because degree and campus are functionally dependent upon only on the column SID of the composite key (SID, course). This can be illustrated by listing the functional dependencies in the table:

SID —> campus, degree

campus —> degree

(SID, Course) —> Marks

To transform the table FIRST into 2NF we move the columns SID, Degree and Campus to a new table called REGISTRATION. The column SID becomes the primary key of this new table.

The definition of second normal form states that only tables with composite primary keys can be in 1NF but not in 2NF.

SID and Campus are NOT unique

[email protected]

Virtual University of Pakistan

To transform the table FIRST into 2NF we move the columns SID, Degree and city to a new table called REGISTRATION. The column SID becomes the primary key of this new table.

Normalization: 2NF

Presence of modification anomalies for tables in 2NF. For the table REGISTRATION, they are:

ƒ INSERT: Until a student gets registered in a degree program, that program cannot be offered!

ƒ DELETE: Deleting any row from REGISTRATION destroys all other facts in the table.

Why there are anomalies?

The table is in 2NF but NOT in 3NF

Tables in 2NF but not in 3NF still contain modification anomalies. In the example of REGISTRATION, they are:

  • INSERT. The fact that a particular campus has a certain degree (Peshawar campus decides to run a PhD program) cannot be inserted until there is a student registered for PhD in the campus.
  • DELETE. Deleting any row from REGISTRATION table destroys the degree information about the campus as well as the association between student and campus.

[email protected]

Virtual University of Pakistan

Normalization: 3NF

All columns must be dependent only on the primary key.

Table PERFORMANCE is already in 3NF. The non-key column, marks, is fully dependent upon the primary key (SID, degree).

REGISTRATION is in 2NF but not in 3NF because it contains a transitive dependency.

A transitive dependency occurs when a non-key column that is a determinant of the primary key is the determinate of other columns.

The concept of a transitive dependency can be illustrated by showing the functional dependencies:

REGISTRATION.SID —> REGISTRATION.Degree REGISTRATION.SID —> REGISTRATION.Campus REGISTRATION.Campus —> REGISTRATION.Degree

Note that REGISTRATION.Degree is determined both by the primary key SID and the non-key column campus.

For a relational table to be in third normal form (3NF) all columns must be dependent only upon the primary key. More formally, a relational table is in 3NF if it is already in 2NF and every non-key column is non transitively dependent upon its primary key. In other words, all non-key attributes are functionally dependent only upon the primary key. Or put another way, all attributes must be directly dependent on the primary key without implied dependencies through other attributes of the relation.

[email protected]

Virtual University of Pakistan

Figure-6.4: Part of example student database system in 3NF

Figure-6.4 shows that the table REGISTRATION is transformed into two tables i.e. Student_Campus and Campus_Degree. If we look at it in the context of memory, we observe that the storage space requirement has increased, for this particular example by about 7%.

Normalization: 3NF Removal of anomalies and improvement in queries as follows:

ƒ INSERT: Able to first offer a degree program, and then students registering in it.

ƒ UPDATE: Migrating students between campuses by changing a single row.

ƒ DELETE: Deleting information about a course, without deleting facts about all columns in the record.

[email protected]

Virtual University of Pakistan

Observe that by virtue of bringing a relational table into 3NF, storage of redundant data is further eliminated, that not only results in saving of space, but also reduces manipulation anomalies. For example, as a consequence following improvements have taken place:

INSERT: If Peshawar campus decides to offer a PhD program, this can be reflected even though there is no student currently registered in that campus. Similarly, facts about new students can be added even though they may not have registered for a course.

UPDATE: Changing the campus of a student (by migration) or a degree program of a campus requires modification of only a single row.

DELETE: Information about courses taken can be deleted without destroying information about a student or a campus.

Conclusions

ƒ Normalization guidelines are cumulative.

ƒ Generally a good idea to only ensure 2NF.

ƒ 3NF is at the cost of simplicity and performance.

ƒ There is a 4NF with no multi-valued dependencies.

ƒ There is also a 5NF.

Normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.