Download Introduction to Database Systems-Lecture 13 Slides-Computer Science and more Slides Introduction to Database Management Systems in PDF only on Docsity!
Data Warehousing & Mining
CPS 116
Introduction to Database Systems
Data integration
Data resides in many distributed, heterogeneous
OLTP (On-Line Transaction Processing) sources
Sales, inventory, customer, …
NC branch, NY branch, CA branch, …
Need to support OLAP (On-Line Analytical
Processing) over an integrated view of the data
Possible approaches to integration
Eager: integrate in advance and store the integrated data
at a central repository called the data warehouse
Lazy: integrate on demand; process queries over
distributed sources—mediated or federated systems
3
OLTP versus OLAP
OLTP
Mostly updates
Short, simple transactions
Clerical users
Goal: ACID, transaction
throughput
OLAP
Mostly reads
Long, complex queries
Analysts, decision makers
Goal: fast queries
Implications on database design and optimization?
OLAP databases do not care much about redundancy
“Denormalize” tables
Many, many indexes
Precomputed query results
4
Eager versus lazy integration
Eager (warehousing)
In advance: before queries
Copy data from sources
Lazy
On demand: at query time
Leave data at sources
) Answer could be stale
) Need to maintain
consistency
) Query processing is local to
the warehouse
Faster
Can operate when sources
are unavailable
) Answer is more up-to-date
) No need to maintain
consistency
) Sources participate in
query processing
Slower
Interferes with local
processing
5
Maintaining a data warehouse
The “ETL” process
Extraction: extract relevant data and/or changes from sources
Transformation: transform data to match the warehouse schema
Loading: integrate data/changes into the warehouse
Approaches
Recomputation
- Easy to implement; just take periodic dumps of the sources, say, every night
- What if there is no “night,” e.g., a global organization?
- What if recomputation takes more than a day?
Incremental maintenance
- Compute and apply only incremental changes; fast if changes are small
- Not easy to do for complicated transformations
- Need to detect incremental changes at the sources
6
“Star” schema of a data warehouse
Big
Constantly growing
Stores measures (often
aggregated in queries)
OID date CID PID SID qty price 100 11/23/2001 c3 p1 s1 1 12 102 12/12/2001 c3 p2 s1 2 17 105 12/24/2001 c5 p1 s3 5 13 ... ... ... ... ... ... ...
PID name cost p1 beer 10 p2 diaper 16 ... ... ...
SID city s1 Durham s2 Chapel Hill s3 RTP ... ...
CID name address city c3 Amy 100 Main St. Durham c4 Ben 102 Main St. Durham c5 Coy 800 Eighth St. Durham ... ... ... ...
Dimension table
Dimension table
Dimension table
Fact table
Small
Updated infrequently
Product
Store
Sale
Customer
Data cube
Customer
Store
Product
ALL
p
p
s
s
s
c3 c4 c
(c3, p2, s1) = 2
(c5, p1, s3) = 5
Simplified schema: Sale ( CID , PID , SID , qty )
(c3, p1, s1) = 1 (c5, p1, s1) = 3
Customer
Store
Product
Completing the cube—plane
(ALL, p1, s3) = 5
(ALL, p2, s1) = 2
(ALL, p1, s1) = 4
Total quantity of sales for each product in each store
ALL
p
p
s
s
s
c3 c4 c
(c3, p2, s1) = 2
(c5, p1, s3) = 5
(c3, p1, s1) = 1 (c5, p1, s1) = 3
SELECT PID, SID, SUM(qty) FROM Sale
GROUP BY PID, SID;
Project all points onto Product - Store plane
9
Completing the cube—axis
(ALL, p2, ALL)
(ALL, p1, ALL)
(ALL, p1, s3) = 5
(ALL, p2, s1) = 2
(ALL, p1, s1) = 4
Total quantity of sales for each product
ALL
p
p
s
s
s
c3 c4 c
(c3, p2, s1) = 2
(c5, p1, s3) = 5
(c3, p1, s1) = 1 (c5, p1, s1) = 3
SELECT PID, SUM(qty) FROM Sale GROUP BY PID;
Further project points onto Product axis
Customer
Store
Product
10
Customer
Store
Product
Completing the cube—origin
(ALL, p1, s3) = 5
(ALL, p2, s1) = 2
(ALL, p1, s1) = 4
Total quantity of sales
ALL
p
p
s
s
s
c3 c4 c
(c3, p2, s1) = 2
(c5, p1, s3) = 5
(c3, p1, s1) = 1 (c5, p1, s1) = 3
SELECT SUM(qty) FROM Sale;
Further project points onto the origin
(ALL, p2, ALL)
(ALL, p1, ALL)
(ALL, ALL, ALL) = 11
11
CUBE operator
Sale ( CID , PID , SID , qty )
Proposed SQL extension:
SELECT SUM(qty) FROM Sale
GROUP BY CUBE CID, PID, SID;
Output contains:
Normal groups produced by GROUP BY
- (c1, p1, s1, sum), (c1, p2, s3, sum), etc.
Groups with one or more ALL’s
- (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.
Can you write a CUBE query using only GROUP BY’s?
Gray et al., “Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab, and Sub-Total.” ICDE 1996
12
Automatic summary tables
Computing GROUP BY and CUBE aggregates is
expensive
OLAP queries perform these operations over and
over again
) Idea: precompute and store the aggregates as
automatic summary tables (a DB2 term)
Maintained automatically as base data changes
Same as materialized views
The Apriori algorithm
Multiple passes over the transactions
Pass k finds all frequent k -itemsets (itemset of size k )
Use the set of frequent k -itemsets found in pass k to
construct candidate ( k +1)-itemsets to be counted in
pass ( k +1)
A ( k +1)-itemset is a candidate only if all its subsets of
size k are frequent
Example: pass 1
TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F
Transactions
s min % = 20%
itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2
Frequent 1-itemsets
(Itemset {F} is infrequent)
21
Example: pass 2
TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F
Transactions
s min % = 20%
itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2
Frequent
1-itemsets
Candidate
2-itemsets
itemset {A,B} {A,C} {A,D} {A,E} {B,C} {B,D} {B,E} {C,D} {C,E} {D,E}
Generate
candidates
itemset count {A,B} 4 {A,C} 4 {A,D} 1 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2 {C,D} 0 {C,E} 1 {D,E} 0
Scan and
count
itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2
Frequent
2-itemsets
Check
min. support
22
Example: pass 3
TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F
Transactions
s min % = 20%
itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2
Frequent
2-itemsets
itemset {A,B,C} {A,B,E}
Candidate
3-itemsets
Generate
candidates
itemset count {A,B,C} 2 {A,B,E} 2
Scan and
count
Check
min. support
itemset count {A,B,C} 2 {A,B,E} 2
Frequent
3-itemsets
23
Example: pass 4
TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F
Transactions
s min % = 20%
itemset count {A,B,C} 2 {A,B,E} 2
Frequent
3-itemsets
Candidate
4-itemsets
itemset count
Generate
candidates
No more itemsets to count!
24
Example: final answer
itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2
Frequent
1-itemsets
itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2
Frequent
2-itemsets
itemset count {A,B,C} 2 {A,B,E} 2
Frequent
3-itemsets