Introduction to Database Systems-Lecture 13 Slides-Computer Science, Slides of Introduction to Database Management Systems

Data Warehousing

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Data Warehousing & Mining
CPS 116
Introduction to Database Systems
2
Data integration
Data resides in many distributed, heterogeneous
OLTP (On-Line Transaction Processing) sources
Sales, inventory, customer, …
NC branch, NY branch, CA branch, …
Need to support OLAP (On-Line Analytical
Processing) over an integrated view of the data
Possible approaches to integration
Eager: integrate in advance and store the integrated data
at a central repository called the data warehouse
Lazy: integrate on demand; process queries over
distributed sources—mediated or federated systems
3
OLTP versus OLAP
OLTP
Mostly updates
Short, simple transactions
Clerical users
Goal: ACID, transaction
throughput
OLAP
Mostly reads
Long, complex queries
Analysts, decision makers
Goal: fast queries
Implications on database design and optimization?
OLAP databases do not care much about redundancy
“Denormalize” tables
Many, many indexes
Precomputed query results
4
Eager versus lazy integration
Eager (warehousing)
In advance: before queries
Copy data from sources
Lazy
On demand: at query time
Leave data at sources
)Answer could be stale
)Need to maintain
consistency
)Query processing is local to
the warehouse
Faster
Can operate when sources
are unavailable
)Answer is more up-to-date
)No need to maintain
consistency
)Sources participate in
query processing
Slower
Interferes with local
processing
5
Maintaining a data warehouse
The “ETL” process
Extraction: extract relevant data and/or changes from source s
Transformation: transform data to match the warehouse schema
Loading: integrate data/changes into the warehouse
Approaches
Recomputation
Easy to implement; just take periodic dumps of the sources, say, every night
What if there is no “night,” e.g., a global organization?
What if recomputation takes more than a day?
Incremental maintenance
Compute and apply only incremental changes; fast if changes are small
Not easy to do for complicated transformations
Need to detect incremental changes at the sources
6
“Star” schema of a data warehouse
Big
Constantly growing
Stores measures (often
aggregated in queries)
OID date CID PID SID qty price
100 11/23/2001 c3 p1 s1 1 12
102 12/12/2001 c3 p2 s1 2 17
105 12/24/2001 c5 p1 s3 5 13
... ... ... ... ... ... ...
PID name cost
p1 beer 10
p2 diaper 16
... ... ...
SID city
s1 Durham
s2 Chapel Hill
s3 RTP
... ...
CID name address city
c3 Amy 100 Main St. Durham
c4 Ben 102 Main St. Durham
c5 Coy 800 Eighth St. Durham
... ... ... ...
Dimension table Dimension table
Dimension table
Fact table
Small
Updated infrequently
Product Store
Sale
Customer
pf3
pf4

Partial preview of the text

Download Introduction to Database Systems-Lecture 13 Slides-Computer Science and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Data Warehousing & Mining

CPS 116

Introduction to Database Systems

Data integration

™ Data resides in many distributed, heterogeneous

OLTP (On-Line Transaction Processing) sources

ƒ Sales, inventory, customer, …

ƒ NC branch, NY branch, CA branch, …

™ Need to support OLAP (On-Line Analytical

Processing) over an integrated view of the data

™ Possible approaches to integration

ƒ Eager: integrate in advance and store the integrated data

at a central repository called the data warehouse

ƒ Lazy: integrate on demand; process queries over

distributed sources—mediated or federated systems

3

OLTP versus OLAP

OLTP

™ Mostly updates

™ Short, simple transactions

™ Clerical users

™ Goal: ACID, transaction

throughput

OLAP

™ Mostly reads

™ Long, complex queries

™ Analysts, decision makers

™ Goal: fast queries

Implications on database design and optimization?

OLAP databases do not care much about redundancy

ƒ “Denormalize” tables

ƒ Many, many indexes

ƒ Precomputed query results

4

Eager versus lazy integration

Eager (warehousing)

™ In advance: before queries

™ Copy data from sources

Lazy

™ On demand: at query time

™ Leave data at sources

) Answer could be stale

) Need to maintain

consistency

) Query processing is local to

the warehouse

ƒ Faster

ƒ Can operate when sources

are unavailable

) Answer is more up-to-date

) No need to maintain

consistency

) Sources participate in

query processing

ƒ Slower

ƒ Interferes with local

processing

5

Maintaining a data warehouse

™ The “ETL” process

ƒ Extraction: extract relevant data and/or changes from sources

ƒ Transformation: transform data to match the warehouse schema

ƒ Loading: integrate data/changes into the warehouse

™ Approaches

ƒ Recomputation

  • Easy to implement; just take periodic dumps of the sources, say, every night
  • What if there is no “night,” e.g., a global organization?
  • What if recomputation takes more than a day?

ƒ Incremental maintenance

  • Compute and apply only incremental changes; fast if changes are small
  • Not easy to do for complicated transformations
  • Need to detect incremental changes at the sources

6

“Star” schema of a data warehouse

™ Big

™ Constantly growing

™ Stores measures (often

aggregated in queries)

OID date CID PID SID qty price 100 11/23/2001 c3 p1 s1 1 12 102 12/12/2001 c3 p2 s1 2 17 105 12/24/2001 c5 p1 s3 5 13 ... ... ... ... ... ... ...

PID name cost p1 beer 10 p2 diaper 16 ... ... ...

SID city s1 Durham s2 Chapel Hill s3 RTP ... ...

CID name address city c3 Amy 100 Main St. Durham c4 Ben 102 Main St. Durham c5 Coy 800 Eighth St. Durham ... ... ... ...

Dimension table

Dimension table

Dimension table

Fact table

™ Small

™ Updated infrequently

Product

Store

Sale

Customer

Data cube

Customer

Store

Product

ALL

p

p

s

s

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

Simplified schema: Sale ( CID , PID , SID , qty )

(c3, p1, s1) = 1 (c5, p1, s1) = 3

Customer

Store

Product

Completing the cube—plane

(ALL, p1, s3) = 5

(ALL, p2, s1) = 2

(ALL, p1, s1) = 4

Total quantity of sales for each product in each store

ALL

p

p

s

s

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

(c3, p1, s1) = 1 (c5, p1, s1) = 3

SELECT PID, SID, SUM(qty) FROM Sale

GROUP BY PID, SID;

Project all points onto Product - Store plane

9

Completing the cube—axis

(ALL, p2, ALL)

(ALL, p1, ALL)

(ALL, p1, s3) = 5

(ALL, p2, s1) = 2

(ALL, p1, s1) = 4

Total quantity of sales for each product

ALL

p

p

s

s

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

(c3, p1, s1) = 1 (c5, p1, s1) = 3

SELECT PID, SUM(qty) FROM Sale GROUP BY PID;

Further project points onto Product axis

Customer

Store

Product

10

Customer

Store

Product

Completing the cube—origin

(ALL, p1, s3) = 5

(ALL, p2, s1) = 2

(ALL, p1, s1) = 4

Total quantity of sales

ALL

p

p

s

s

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

(c3, p1, s1) = 1 (c5, p1, s1) = 3

SELECT SUM(qty) FROM Sale;

Further project points onto the origin

(ALL, p2, ALL)

(ALL, p1, ALL)

(ALL, ALL, ALL) = 11

11

CUBE operator

™ Sale ( CID , PID , SID , qty )

™ Proposed SQL extension:

SELECT SUM(qty) FROM Sale

GROUP BY CUBE CID, PID, SID;

™ Output contains:

ƒ Normal groups produced by GROUP BY

  • (c1, p1, s1, sum), (c1, p2, s3, sum), etc.

ƒ Groups with one or more ALL’s

  • (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.

™ Can you write a CUBE query using only GROUP BY’s?

Gray et al., “Data Cube: A Relational Aggregation Operator

Generalizing Group-By, Cross-Tab, and Sub-Total.” ICDE 1996

12

Automatic summary tables

™ Computing GROUP BY and CUBE aggregates is

expensive

™ OLAP queries perform these operations over and

over again

) Idea: precompute and store the aggregates as

automatic summary tables (a DB2 term)

ƒ Maintained automatically as base data changes

ƒ Same as materialized views

The Apriori algorithm

Multiple passes over the transactions

™ Pass k finds all frequent k -itemsets (itemset of size k )

™ Use the set of frequent k -itemsets found in pass k to

construct candidate ( k +1)-itemsets to be counted in

pass ( k +1)

ƒ A ( k +1)-itemset is a candidate only if all its subsets of

size k are frequent

Example: pass 1

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2

Frequent 1-itemsets

(Itemset {F} is infrequent)

21

Example: pass 2

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2

Frequent

1-itemsets

Candidate

2-itemsets

itemset {A,B} {A,C} {A,D} {A,E} {B,C} {B,D} {B,E} {C,D} {C,E} {D,E}

Generate

candidates

itemset count {A,B} 4 {A,C} 4 {A,D} 1 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2 {C,D} 0 {C,E} 1 {D,E} 0

Scan and

count

itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2

Frequent

2-itemsets

Check

min. support

22

Example: pass 3

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2

Frequent

2-itemsets

itemset {A,B,C} {A,B,E}

Candidate

3-itemsets

Generate

candidates

itemset count {A,B,C} 2 {A,B,E} 2

Scan and

count

Check

min. support

itemset count {A,B,C} 2 {A,B,E} 2

Frequent

3-itemsets

23

Example: pass 4

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A,B,C} 2 {A,B,E} 2

Frequent

3-itemsets

Candidate

4-itemsets

itemset count

Generate

candidates

No more itemsets to count!

24

Example: final answer

itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2

Frequent

1-itemsets

itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2

Frequent

2-itemsets

itemset count {A,B,C} 2 {A,B,E} 2

Frequent

3-itemsets