

































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Database Design, Relational Tables, Introduction to E-R, Entity, Transformation Rule, Relationship, Many-to-one, E-R Concepts, Generalization Hierarchies, Normalization, Anomalies, Functional Dependencies, Inclusion Rule, Armstrong's Axioms, Augmentation Rule, Pseudotransitivity Rule, Accumulation Rule, Lossy Decomposition, Normal Forms, Boyce-codd Normal Form, Third Normal Form, Minimal Cover Algorithm
Typology: Study notes
1 / 41
This page cannot be seen from the preview
Don't miss anything!


































Class 15
Chapter 6, Database Design. We now tackle the following:
Q: how do we analyze an enterprise and list the data items for a database, then decide how to place these data items columns in relational tables.
Up to now, we've had this done for us, but now we have to decide. For example should we just put them all in the same table?
Answer is no, because (as we'll see) the data doesn't behave well. Consider the CAP database. Could we just have a single table, CAPORDERS,
CAPORDERS := C X A X P X O where C.cid = O.cid and A.aid = O.aid and P.pid = O.pid
See any problems with that? One problem is redundency — Every row of CAPORDERS must duplicate all product information, and this happens lots of times. Ditto for customers and agents.
If you look at the number of columns and assume lots of orders for each C, A, P, see BIG waste of disk space compared to separate tables. Bad because cname, city, pname, etc., are long compared to ORDERS columns.
Further, note there is in PRODUCTS a quantity column, meaning quantity on hand. Assume needs updating every time a new order is placed.
But then in CAPORDERS with popular product, number or rows grows with number of orders (thousands!). ALL of them have to be ordered each time. BIG inefficiency compared to separate table for products.
And what do happens when we place a new order -- do we ask the application programmer to get the user (key entry clerk) to enter all the information about customer, agent, product in CAPORDERS every time?
Doesn't make sense -- waste of time -- clerk might get it wrong.
**** So the program looks up data (say for product) and enters it in CAPORDERS? Where does it look it up? Find another row with the same product and copy data?
Note that when have separate table for products, just enter pid (which can be looked up in products table if start with product name/description).
What if some product is out of stock and takes a long time to reorder so all the orders for that product have been filled from CAPORDERS (we don't keep these orders around forever).
But now where do we find product information? Have we forgotten about this product because we haven't had orders for awhile? Note solved if we have distinct products table.
There are also a number of "business rules" that it would be nice to have the database system guarantee -- simple things, but quite subject to error in data entry.
For example, we say that the pid column is a unique identifier for a product. That's a business rule. No two products have the same pid.
But review now the idea of CAPORDERS -- everything joined. For each orders row, have all info on customer, agent, and product. Put up picture.
It's a little hard to figure out what that means in the CAPORDERS table where pid will be duplicated on every row where a given product is ordered, but let's t r y.
In the database design where PRODUCTS is a separate table, we say that that pid is a "key" for the table, i.e. it is unique.
If we think in terms of rows in PRODUCTS being duplicated in CAPORDERS (for every order that deals with the same product), then how would we characterize pid?
Business rule: every unique pid value is associated with a unique quantity (it would be a bad thing if in CAPORDERS two rows with the same pid had different quantity values). Is the reverse true?
We write this: pid -> quantity, and say pid functionally determines quantity, or quantity is functionally determined by pid.
Class_rooms is a good example of an entity. Distinguishable (by location). Have common properties, such as seating capacity, which will turn into a column of a table (different values of these common properties).
Class_periods is an entity? An interval of time is a real-world object? A bit weird, but basically think of assigning a student to a classroom during a class period, so time interval is treated just like classroom.
Normally, an entity such as Class_rooms or Customers is mapped to a relational table, and each row is an entity occurrence, or entity instance, representing a particular object.
In the case of Products, an entity instance is a category of objects sold by our wholesale company.
It is unusual in the field to use a plural for entity and table, Customers instead of Customer. We do this to emphasize that the entity represents a Set of objects.
Def. 6.1.2 Attribute. An attribute is a data item that describes a property of an entity or a relationship (to follow).
Note that we have special terminology for special kinds of attributes, see pg.
An identifier is an attribute or set of attributes that uniquely identifies an entity instance. Like cid for Customers. Can have primary identifier.
A descriptor is a non-key attribute, descriptive. E.g., city a customer is in, color of a car, seating capacity of a classroom, qty of an order.
A multi-valued attribute is one which can take on several values simultaneously for an entity instance. E.g., keyword for Journal_articles or hobby for Employees. Disallowed by relational rule 1, but in ORDBMS.
In E-R, we can have a composite attribute (like a nested struct inside the row), e.g., cname can have fname, lname, midinit as three parts.
Note that the relational model, rule 1, which disallows multi-valued columns, also disallows composite columns. OK in ORDBMS, but must map composite attribute to multiple columns in relational model.
Note that term attribute is also used in relations, but ideas correspond in the mapping of entities and relations into relational tables.
Note that while entity instances within an entity are said to be distinct, but this is only a mathematical idea until we have identier attributes.
We write E is an entity, with entity instances {e 1 , e 2 ,.. ., e (^) n }. Need an identifier attribute defined, unique for each occurrence e (^) i.
Put up diagram from pg. 333, Figure 6.2.
Transforming Entities and Attributes to Relations is pretty obvious, as mentioned previously.
Transformation Rule 1. An entity is mapped to a single table. The single- valued attributes of the Entity are mapped to columns (composite attributes are mapped to multiple simple columns). Entity occurrences become rows of the table.
Transformation Rule 2. A multi-valued attribute must be mapped to its own table. See bottom of pg. 334. (No longer true in ORDBMS.)
Not too much power so far, but relationship adds real modeling power.
Def. 6.1.3. Relationship (pg. 335). Given an ordered list of m entities, E 1 , E 2 ,... , E (^) m , (where the same entity may occur more than once in the list), a relationship R defines a rule of correspondence between the instances of these entities. Specifically, R represents a set of m-tuples, a subset of the Cartesian product of entity instances.
Instructors teaches Course_sections Employees works_on Projects (attribute, percent (of time)) Employees manages Employees (ring, or recursive relationship)
See top of pg 336 for diagram. Note "roles" of labeled connecting lines in case of recursive relationship.
Example 6.1.3. The orders table in the CAP database does NOT represent a relationship. Reason is that orders rows do not correspond to a subset of the entities involved. Multiple orders can exist with same cid, aid, pid.
The orders table is really an entity, with identifier ordno.
Class 16.
Remember the E-R diagram on pg 336 with the relationship that says Employees works_on Projects, where works_on is a relationship. works_on has the connected attribute percent. Draw it.
Note: percent, associated with relationship, i.e., a value with each rela- tionship instance.
The relationship instance represents a specific pairing of an Employees instance with a Projects instance; percent represents the percent of time an employee instance works on that project.
Clearly have the business rule that an employee can work on more than one project. Also have rule that more than one employee can work on each project. This binary relationship is said to be Many-to-Many.
Now it's going to turn out that for relationships that are Many-to-Many, a table is needed in the relational model to represent the relationship. But this is NOT always true if the relationship is not Many-to-Many.
Consider the relationship (also on pg. 336): Instructors teaches Course_sections.
Say we have the rule that an Instructor can teach more than one course section (usually does, unfortunately for me), but we make the rule that only one instructor is associated with every course section.
This means that if there are two instructors teaching a class, one of the two is actually responsible for the course, and the team approach is unofficial.
Now we know from transformation rule 1 that both entities Instructors and Course_sections map to relational tables. (Draw this, with some attributes: iid for instructors, csid for course_sections -- assume no multi-valued attributes so these are only two tables).
Now the question is, do we need another table for the relationship teaches.
Answer: No. We can put a column in the course_sections table that uniquely identifies the instructor teaching each row (instance). This is done with an iid column.
Note that the iid column in the course_sections table is NOT an attribute of the Course_sections entity. The iid column instead represents the teaches relationship.
In relational terminology, this column is known as a foreign key in the course_sections table (not a key for course_sections but one for the for- eign table instructors).
OK, what's the difference? Why did one relationship, Employees works_on Projects require a table for works_on, and the other, Instructors teaches Course_sections, require no new table?
Because one relationship is Many-to-Many and the other is Many-to-One! The Many-to-One relationship can be done with a foreign key because we only need to identify (at most) ONE connecting instance on one side.
Note these ideas are all BUSINESS RULES. They are imposed by the DBA for all time by the definition of the tables. We shall see how shortly.
Look at Figure 6.6. Entities E and F, relationship R. Lines between dots. Dots are entity instances. Lines are relationship instances.
If all dots in the entity E have AT MOST one line coming out, we say: max-card(E, R) = 1.
If more than one line out is possible, we say max-card(E, R) = N.
If all dots in the entity E have AT LEAST one line coming out, we say: min-card(E, R) = 1.
If some dots might not have a line coming out, we say min-card(E, R) = 0.
We combine these, by saying card(E, R) = (x, y) if min-card(E, R) = x and max-card(E, R) = y. (x is either 0 or 1 and y is either 1 or N.)
Go over Figure 6.7 on pg 341. Note that for recursive relationship manages, include role: card(employees(reports_to), manages) = (0, 1)
Note that saying min-card(E, R) = 0 is really NOT MAKING A RESTRICTION. There might be no lines leaving a dot, there might be one, or more (min-card NEVER says anything about maximum number).
1-1. Is it optional on one side or is it mandatory on both sides?
Optional on one side. Postmen carry Mailbags. Every postman carries one and only one mailbag, and every mailbag is carried by at most one postman, but there might be some spares in stock that are carried by none.
Represent as two tables, foreign key column in one with mandatory partic- ipation: column defined to be NOT NULL. Can faithfully represent mandatory participation. Clearly representing single-valued participation.
(Idea of faithful representation: programmer can't break the rule even if writes program with bug. Note can NOT faithfully represent single-value participation for both mail-bags AND postmen.)
1-1 and Mandatory on both sides: never can break apart. It's appropriate to think of this as two entities in a single table. E.g. couples on a dance floor -- no-one EVER is considered to be without a partner. Avoids foreign keys.
(Really? But might change partners and some info might be specific to individuals of partners - his height, age, weight - her height, age, weight.
This logical design is more concerned with not being able to end up with a mistake than making a transformation easy.)
Section 6.3, Additional E-R concepts.
Attributes can use idea of cardinality as well. See Figure 6.10.
(0, y) means don't have to say not null, (1, y) means do.
(x, 1) most common, single valued attribute, (x, N) multi-valued.
Weak entities. An entity that can't exist unless another entity exists. Depends for its existence and identification on another entity.
E.g., Line-items on an Order. Customer places an order, orders several products at once. Order has multiple line items. Line_items is a weak entity dependent on the entity Orders. See Figure 6.11.
There is an N-1 relationship, has_item, that relates one Orders instance to many Line_items instances.
Therefore, by transformation rules, Line_items sent to table, Orders sent to table, foreign key in Many side, line_items table.
Note that the identifier for a Line_items, lineno, is enough in E-R model to identify the weak entity, since can go back through has_item relationship to find what order it belongs to.
In relational model, must be identified by column value. Note ordno is not an attribute of line_items but a foreign key, and lineno and ordno must be used together as a key for line_items!!!
OK, think. What's the difference between the two situations: Orders has_item Line_Items and Employees with multi-value attribute hobbies? Map the same way into relational tables!
Possibly very little difference. One man's attribute is another man's entity. If I cared about tracking all hobbies in the company so I could provide well thought out relaxation rooms, might say hobbies are entities.
In case of line number, there are usually several attributes involved, lineno, product ordered, quantity of product, cost, so seems reasonable to say Line_items is an entity, albeit a weak one.
Is this a ternary relationship, then, relating these three? We say it is not, because it is possible to set up two N-1 binary relationships that more faithfully represent the situation.
I wouldn't want to assign two gates to one flight, for example, or two flights to one passenger. Therefore:
See Figure 6.13, pg. 351. marshalls and travels on. Ternery would be N-N-N.
Now work out cardinalities of relationships.
A gate might not be used or it might be used for multiple flights, so have (0, N) participation in marshalls. A flight must have a gate and can have only one gate, so (1, 1).
A passenger must have a flight and only one, so (1, 1) participation in travels_on. A flight must have a passenger (or it will certainly be cancelled) and may (probably will) have many, so (1, N).
Others are clear. Each passenger must have one and only one seat, a seat may or may not be used. Each seat is on some flight, but each flight has multiple seats.
Now transform into relational tables. Map entities, see bottom of pg. 321. Draw on board.
Since seats is weak entity (single-valued participation in has_seat), add flightno to seats table. Now that relationship is taken care of.
Passengers has single-valued participation in two relationships, so add seatno and flightno to passengers (seatno and flightno are needed for seat_assignment, and flightno is needed for travels_on, so one counted twice). Now those relationships are taken care of.
Only one left is Gates marshalls Flights, and put gateno as foreign key in flights table. Done.
See how three (1, 1) participations can be faithfully represented with not- null for seatno and flightno in passengers, not null for flightno in seats, not null for gateno in flights.
Section 6.5. Preliminaries for Normalization.
Idea in normalization, start with data item names (to be columns in some tables) together with a list of rules of relatedness. Then start with all data items in one table (universal table).
Rules of relatedness and a desire to avoid certain types of bad behavior (anomalies) causes us to factor this big table into smaller tables, achieving more and more restrictive forms (Normal Forms). 1NF, 2NF, 3NF, BCNF.
Will not cover 4NF, 5NF (frequently not considered in commercial use).
The idea is that from the factored tables can always get back all the original data by joins; this is called a "lossless decomposition".
A second desire we have is that the database system will be able to check the rules of relatedness as simply as possible (generally by simply enforcing uniqueness of a column in a table).
Point of Normalization is that reach the same design as with E-R, but it is more cut and dried, uses intuition less (given the set of rules — a big given!). Both E-R and Normalization have their points.
To start. 1NF means no repeating fields in tables; no relational products allow such repeating rules (Montage, now Illustra, does, showever).
OK. Running Example: Employee Information. See pg. 354.
Explain each: emp_id, emp_name, emp_phone, dept_name, dept_phone, dept_mgrname skill_id, skill_name, skill_date, skill_lvl
Design from an E-R standpoint is easy. Entities Emps, Depts, Skills, and relationship between Emps and Skills, has_skill, N-N so own table.
But let's take normalization approach. Start with Universal table; see Fig 6.16, emp_info, on pg 355.
Bad design. Turns out key is emp_id skill_id. How do we know that? By rules of relatedness. What's the problem with that? See following.
Anomalies. There is replication of employee data on different rows of emp_info, and this seems unnatural. But why is that bad?
Class 18.
Section 6.6. Functional Dependencies.
Remember what it meant to say that the identifier pid is a key (identifier) for product information in the CAPORDERS table?
The pid value is repeated many times, so not unique as a key. But still, pid uniquely determines quantity and pname and pcity &c. say pid -> quantity.
Def 6.6.1. Given a table T with at least two attributes A and B, we say that A -> B (A functionally determines B, or B is functionally dependent on A) iff it is the intent of the designer that for any set of rows that might exist in the table, two rows in T cannot agree on A and disagree on B.
More formally, given two rows r 1 and r 2 in T, if r 1 (A) = r 2 (A) then r 1 (B) = r 2 (B).
Idea of function in calculus — graphs. See it's the same, domain -> range.
Ex. 6.6.1. in emp_info table: emp_id -> emp_name, emp_id -> emp_phone, emp_id -> dept_name.
Think what this means in E-R terms. Consider your intution - if two rows with same emp_id and different emp_phone, assume data corrupted, but if same emp_phone and different emp_id say "Oh, so employees can share a phone." Then we say, emp_phone -/-> emp_id
Def. 6.6.2. OK, have when one attribute A functionally determines another B. Now sets of attributes: X = A 1 A 2... A (^) k , and Y = B 1 B 2... B (^) m. Say X -
Y iff it is the intention of the designer that two rows cannot simultane- ously agree on X and disagree on Y.
Same wording as Def. 6.6.1, but note that for a set X, agrees if and only if agrees on ALL column value, disagrees if disagrees on ANY column value.
Ex. 6.6.3. We claim what follows is all FDs of emp_info. Interpret by E-R.
(1) emp_id -> emp_name emp_phone dept_name (2) dept_name -> dept_phone dept_mgrname (3) skill_id -> skill_name (4) emp_id skill_id -> skill_date skill_lvl
Note that if we know emp_id -> emp_name, emp_id -> emp_phone, and emp_id -> dept_name, then know emp_id -> emp_name emp_phone dept_name. Easy to see by definition, but DOES require thinking about definition. Three facts about singleton attribute FDs lead to X -> Y fact.
Note that we can conclude from above that designer does not intend that skill_name should be unique for a particular skill. skill_name -> skill_id is not there, nor is it implied by this set (no skill_name on left).
Note that a set of FDs has logical implications to derive OTHER FDs. E.g., emp_id -> emp_name emp_phone dept_name above. There are RULES for how to derive some FDs from others. The simplest one fillows.
Theorem 6.6.3. Inclusion rule. Given T with Head(T). If X and Y are sets in Head(T) and Y ⊆ X, then X -> Y (for ANY content for T). (Venn Diagram.)
Proof. By def, need only demonstrate that if two rows u and v agree on X they must agree on Y. But Y is a subset of X, so seems obvious.
Def. 6.6.4. A trivial dependency is an FD of the form X -> Y that holds for any table T where X UNION Y ⊆ Head(T). (Assume ANY content for T.)
Theorem 6.6.5. Given a trivial dependency X -> Y in T, it must be the case that Y ⊆ X. (Venn Diagram, X, Y disjoint, show A.)
Proof. Create a table T with Head(T) = X UNION Y and consider the attributes in Y - X. Assume it is non-empty (so it is false that Y ⊆ X) and we find a contradiction. Let A be an attribute in Y - X. A trivial dependency must hold for any possible content of T. But since A is not in X, it is possible to construct two rows u and v in T alike in all values for X but having different values in A. Then X -> Y does not hold for this constructed contents, in contradiction to its triviality.
Def. 6.6.6. Armstrong's Axioms. From the following small set of basic rules of implication among FDs, we can derive all others that are true.
[1] Inclusion rule: if Y ⊆ X, then X -> Y [2] Transitivity rule: if X -> Y and Y -> Z, then X -> Z [3] Augmentation rule: if X -> Y, then X Z -> Y Z ( LEAVE UP : Note X Z for sets is the same as X UNION Z.)
rewritten (d) X -> X Y. Now by (b) and augmentation, we have (e) X Y -> Y Z. And by (d) and (e) and transitivity, we have X -> Y Z, the desired result. [2] & [4] Proved in text.
The idea is that we can use these new Rules as if they were axioms, as facts to prove yet other rules or special cases. Like proving theorem in Geometry, can then use to prove other theorems.
In what follows, when we list a set of FDs, we normally try to list a MINIMAL set, so that a smaller set doesn't exist that will imply these. It will turn out that finding a minimal set of FDs is very important in finding the right relational design by Normalization.
Example 6.6.4. Pg. 338. Consider the table T below fixed in content for all time so the intended FDs can be read off (VERY UNUSUAL). Let's try to list a minimal set of FDs.
Table T row # A B C D 1 a1 b1 c 1 d 2 a1 b1 c 2 d 3 a2 b1 c 1 d 4 a2 b1 c 3 d
Analysis. Start by considering FDs with a single attribute on the left.
Always have the trivial FDs, A −> A, B −> B, C −> C, and D −> D, but don't list trivial FDs in a minimal set.
(a) All values of the B attribute are the same, so it can never happen f o r any other attribute P (i.e., where P represents A, C, or D) that r 1 (P) = r 2 (P) while r 1 (B) ≠ r 2 (B); thus we see that A −> B, C −> B, and D −> B.
At the same time no other attribute P is functionally dependent on B since they all have at least two distinct values, and so there are always two rows r 1 and r 2 such that r 1 (P) ≠ r 2 (P) while r 1 (B) = r 2 (B); thus: B - / -
A, B -/-> C, and B -/-> D.
(b) Because the D values are all different, in addition to D −> B of p a r t (a), we also have D −> A and D −> C. We state: a KEY (D) functionally determines everything else, which will turn out to be the point.
At the same time D is not functionally dependent on anything else since all other attributes have at least two duplicate values, so in addition to B - / -
D of part (a), we have A -/-> D, and C -/-> D.
List all below without A to C and C to A and show how we know.
(c) We have A -/-> C (because of rows 1 and 2) and C -/-> A (because o f rows 1 and 3). Therefore, we can list all FDs (and failed FDs) with a single attribute on the left (we provide a letter in parentheses keyed to t h e paragraph above that give us each fact).
(a) A −> B (a) B -/-> A (c) C -/-> A (b) D −> A (c) A -/-> C (a) B -/-> C (a) C −> B (a) D −> B (b) A -/-> D (a) B -/-> D (b) C -/-> D (b) D −> C
By the union rule, whenever a single attribute on the left functionally determines several other attributes, as with D above, we can combine t h e attributes on the right: D −> A B C. From the analysis so far, we have t h e following set of FDs (which we believe to be minimal):
(1) A −> B, (2) C −> B, (3) D −> A B C
(What is the key for this table? Note, really D -> A B C D)
Now consider FDs with pairs of attributes on the left. (d) Any pair containing D determines all other attributes, by FD (3) above and t h e augmentation rule, so there is no new FD with D on the left that is n o t already implied.
(e) The attribute B combined with any other attribute P on the left, still functionally determines only those attributes already determined by P, as we see by the following argument. If P -/-> Q this means there are rows r 1 and r 2 such that r 1 (Q) ≠ r 2 (Q) while r 1 (P) = r 2 (P). But because B has equal values on all rows, we know that r 1 (B P) = r 2 (B P) as well, so B P - / -
Q. Thus we get no new FDs with B on the left.
(f) Now the only pair of attributes that does not contain B or D is A C, and since A C has distinct values on each row (examine table T again!), we know that A C −> A B C D. This is new, but is it minimal?