Introduction to Database Systems-Lecture 13 Slides-Computer Science | Slides Introduction to Database Management Systems

Data Warehousing & Mining

CPS 116

Introduction to Database Systems

Data integration

Data resides in many distributed, heterogeneous

OLTP (On-Line Transaction Processing) sources

Sales, inventory, customer, …

NC branch, NY branch, CA branch, …

Need to support OLAP (On-Line Analytical

Processing) over an integrated view of the data

Possible approaches to integration

Eager: integrate in advance and store the integrated data

at a central repository called the data warehouse

Lazy: integrate on demand; process queries over

distributed sources—mediated or federated systems

OLTP versus OLAP

OLTP

Mostly updates

Short, simple transactions

Clerical users

Goal: ACID, transaction

throughput

OLAP

Mostly reads

Long, complex queries

Analysts, decision makers

Goal: fast queries

Implications on database design and optimization?

OLAP databases do not care much about redundancy

“Denormalize” tables

Many, many indexes

Precomputed query results

Eager versus lazy integration

Eager (warehousing)

In advance: before queries

Copy data from sources

Lazy

On demand: at query time

Leave data at sources

)Answer could be stale

)Need to maintain

consistency

)Query processing is local to

the warehouse

Faster

Can operate when sources

are unavailable

)Answer is more up-to-date

)No need to maintain

consistency

)Sources participate in

query processing

Slower

Interferes with local

processing

Maintaining a data warehouse

The “ETL” process

Extraction: extract relevant data and/or changes from source s

Transformation: transform data to match the warehouse schema

Loading: integrate data/changes into the warehouse

Approaches

Recomputation

• Easy to implement; just take periodic dumps of the sources, say, every night

• What if there is no “night,” e.g., a global organization?

• What if recomputation takes more than a day?

Incremental maintenance

• Compute and apply only incremental changes; fast if changes are small

• Not easy to do for complicated transformations

• Need to detect incremental changes at the sources

“Star” schema of a data warehouse

Big

Constantly growing

Stores measures (often

aggregated in queries)

OID date CID PID SID qty price

100 11/23/2001 c3 p1 s1 1 12

102 12/12/2001 c3 p2 s1 2 17

105 12/24/2001 c5 p1 s3 5 13

... ... ... ... ... ... ...

PID name cost

p1 beer 10

p2 diaper 16

... ... ...

SID city

s1 Durham

s2 Chapel Hill

s3 RTP

... ...

CID name address city

c3 Amy 100 Main St. Durham

c4 Ben 102 Main St. Durham

c5 Coy 800 Eighth St. Durham

... ... ... ...

Dimension table Dimension table

Dimension table

Fact table

Small

Updated infrequently

Product Store

Sale

Customer

Partial preview of the text

Download Introduction to Database Systems-Lecture 13 Slides-Computer Science and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Data Warehousing & Mining

CPS 116

Introduction to Database Systems

Data integration

Data resides in many distributed, heterogeneous

OLTP (On-Line Transaction Processing) sources

Sales, inventory, customer, …

NC branch, NY branch, CA branch, …

Need to support OLAP (On-Line Analytical

Processing) over an integrated view of the data

Possible approaches to integration

Eager: integrate in advance and store the integrated data

at a central repository called the data warehouse

Lazy: integrate on demand; process queries over

distributed sources—mediated or federated systems

OLTP versus OLAP

OLTP

Mostly updates

Short, simple transactions

Clerical users

Goal: ACID, transaction

throughput

OLAP

Mostly reads

Long, complex queries

Analysts, decision makers

Goal: fast queries

Implications on database design and optimization?

OLAP databases do not care much about redundancy

“Denormalize” tables

Many, many indexes

Precomputed query results

Eager versus lazy integration

Eager (warehousing)

In advance: before queries

Copy data from sources

Lazy

On demand: at query time

Leave data at sources

) Answer could be stale

) Need to maintain

consistency

) Query processing is local to

the warehouse

Faster

Can operate when sources

are unavailable

) Answer is more up-to-date

) No need to maintain

consistency

) Sources participate in

query processing

Slower

Interferes with local

processing

Maintaining a data warehouse

The “ETL” process

Extraction: extract relevant data and/or changes from sources

Transformation: transform data to match the warehouse schema

Loading: integrate data/changes into the warehouse

Approaches

Recomputation

Easy to implement; just take periodic dumps of the sources, say, every night
What if there is no “night,” e.g., a global organization?
What if recomputation takes more than a day?

Incremental maintenance

Compute and apply only incremental changes; fast if changes are small
Not easy to do for complicated transformations
Need to detect incremental changes at the sources

“Star” schema of a data warehouse

Big

Constantly growing

Stores measures (often

aggregated in queries)

OID date CID PID SID qty price 100 11/23/2001 c3 p1 s1 1 12 102 12/12/2001 c3 p2 s1 2 17 105 12/24/2001 c5 p1 s3 5 13 ... ... ... ... ... ... ...

PID name cost p1 beer 10 p2 diaper 16 ... ... ...

SID city s1 Durham s2 Chapel Hill s3 RTP ... ...

CID name address city c3 Amy 100 Main St. Durham c4 Ben 102 Main St. Durham c5 Coy 800 Eighth St. Durham ... ... ... ...

Dimension table

Fact table

Small

Updated infrequently

Product

Store

Sale

Customer

Data cube

Customer

Store

Product

ALL

p

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

Simplified schema: Sale ( CID , PID , SID , qty )

(c3, p1, s1) = 1 (c5, p1, s1) = 3

Customer

Store

Product

Completing the cube—plane

(ALL, p1, s3) = 5

(ALL, p2, s1) = 2

(ALL, p1, s1) = 4

Total quantity of sales for each product in each store

ALL

p

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

(c3, p1, s1) = 1 (c5, p1, s1) = 3

SELECT PID, SID, SUM(qty) FROM Sale

GROUP BY PID, SID;

Project all points onto Product - Store plane

Completing the cube—axis

(ALL, p2, ALL)

(ALL, p1, ALL)

(ALL, p1, s3) = 5

(ALL, p2, s1) = 2

(ALL, p1, s1) = 4

Total quantity of sales for each product

ALL

p

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

(c3, p1, s1) = 1 (c5, p1, s1) = 3

SELECT PID, SUM(qty) FROM Sale GROUP BY PID;

Further project points onto Product axis

Customer

Store

Product

Customer

Store

Product

Completing the cube—origin

(ALL, p1, s3) = 5

(ALL, p2, s1) = 2

(ALL, p1, s1) = 4

Total quantity of sales

ALL

p

s

c3 c4 c

(c3, p2, s1) = 2

(c5, p1, s3) = 5

(c3, p1, s1) = 1 (c5, p1, s1) = 3

SELECT SUM(qty) FROM Sale;

Further project points onto the origin

(ALL, p2, ALL)

(ALL, p1, ALL)

(ALL, ALL, ALL) = 11

CUBE operator

Sale ( CID , PID , SID , qty )

Proposed SQL extension:

SELECT SUM(qty) FROM Sale

GROUP BY CUBE CID, PID, SID;

Output contains:

Normal groups produced by GROUP BY

(c1, p1, s1, sum), (c1, p2, s3, sum), etc.

Groups with one or more ALL’s

(ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.

Can you write a CUBE query using only GROUP BY’s?

Gray et al., “Data Cube: A Relational Aggregation Operator

Generalizing Group-By, Cross-Tab, and Sub-Total.” ICDE 1996

Automatic summary tables

Computing GROUP BY and CUBE aggregates is

expensive

OLAP queries perform these operations over and

over again

) Idea: precompute and store the aggregates as

automatic summary tables (a DB2 term)

Maintained automatically as base data changes

Same as materialized views

The Apriori algorithm

Multiple passes over the transactions

Pass k finds all frequent k -itemsets (itemset of size k )

Use the set of frequent k -itemsets found in pass k to

construct candidate ( k +1)-itemsets to be counted in

pass ( k +1)

A ( k +1)-itemset is a candidate only if all its subsets of

size k are frequent

Example: pass 1

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2

Frequent 1-itemsets

(Itemset {F} is infrequent)

Example: pass 2

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A} 6 {B} 7 {C} 6 {D} 2 {E} 2

Frequent

1-itemsets

Candidate

2-itemsets

itemset {A,B} {A,C} {A,D} {A,E} {B,C} {B,D} {B,E} {C,D} {C,E} {D,E}

Generate

candidates

itemset count {A,B} 4 {A,C} 4 {A,D} 1 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2 {C,D} 0 {C,E} 1 {D,E} 0

Scan and

count

itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2

Frequent

2-itemsets

Check

min. support

Example: pass 3

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Transactions

s min % = 20%

itemset count {A,B} 4 {A,C} 4 {A,E} 2 {B,C} 4 {B,D} 2 {B,E} 2

Frequent

2-itemsets

itemset {A,B,C} {A,B,E}

Candidate

3-itemsets

Generate

candidates

itemset count {A,B,C} 2 {A,B,E} 2

Scan and

count

Check

min. support

itemset count {A,B,C} 2 {A,B,E} 2

Frequent

3-itemsets

Example: pass 4

TID items T001 A, B, E T002 B, D T003 B, C T004 A, B, D T005 A, C T006 B, C T007 A, C T008 A, B, C, E T009 A, B, C T010 F

Introduction to Database Systems-Lecture 13 Slides-Computer Science, Slides of Introduction to Database Management Systems

Related documents

Partial preview of the text

Download Introduction to Database Systems-Lecture 13 Slides-Computer Science and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Data Warehousing & Mining

CPS 116

Introduction to Database Systems

Data integration

 Data resides in many distributed, heterogeneous

OLTP (On-Line Transaction Processing) sources

 Sales, inventory, customer, …

 NC branch, NY branch, CA branch, …

 Need to support OLAP (On-Line Analytical

Processing) over an integrated view of the data

 Possible approaches to integration

 Eager: integrate in advance and store the integrated data

at a central repository called the data warehouse

 Lazy: integrate on demand; process queries over

distributed sources—mediated or federated systems

OLTP versus OLAP

OLTP

 Mostly updates

 Short, simple transactions

 Clerical users

 Goal: ACID, transaction

throughput

OLAP

 Mostly reads

 Long, complex queries

 Analysts, decision makers

 Goal: fast queries

Implications on database design and optimization?

OLAP databases do not care much about redundancy

 “Denormalize” tables

 Many, many indexes

 Precomputed query results

Eager versus lazy integration

Eager (warehousing)

 In advance: before queries

 Copy data from sources

Lazy

 On demand: at query time

 Leave data at sources

) Answer could be stale

) Need to maintain

consistency

) Query processing is local to

the warehouse

 Faster

 Can operate when sources

are unavailable

) Answer is more up-to-date

) No need to maintain

consistency

) Sources participate in

query processing

 Slower

 Interferes with local

processing

Maintaining a data warehouse

 The “ETL” process

 Extraction: extract relevant data and/or changes from sources

 Transformation: transform data to match the warehouse schema

 Loading: integrate data/changes into the warehouse

 Approaches

 Recomputation

 Incremental maintenance

“Star” schema of a data warehouse

 Big

 Constantly growing

 Stores measures (often

aggregated in queries)

Dimension table

Dimension table

Dimension table

Fact table

 Small

 Updated infrequently

Product

Store

Data resides in many distributed, heterogeneous

Sales, inventory, customer, …

NC branch, NY branch, CA branch, …

Need to support OLAP (On-Line Analytical

Possible approaches to integration

Eager: integrate in advance and store the integrated data

Lazy: integrate on demand; process queries over

Mostly updates

Short, simple transactions

Clerical users

Goal: ACID, transaction

Mostly reads

Long, complex queries

Analysts, decision makers

Goal: fast queries

“Denormalize” tables

Many, many indexes

Precomputed query results

In advance: before queries

Copy data from sources

On demand: at query time

Leave data at sources

Faster

Can operate when sources

Slower

Interferes with local

The “ETL” process

Extraction: extract relevant data and/or changes from sources

Transformation: transform data to match the warehouse schema

Loading: integrate data/changes into the warehouse

Approaches

Recomputation

Incremental maintenance

Big

Constantly growing

Stores measures (often

Small

Updated infrequently