Dimensional Modeling and Data Warehousing, Study Guides, Projects, Research of Data Warehousing

Uploading Data warehousing topics in this docsity

Typology: Study Guides, Projects, Research

2017/2018

Uploaded on 01/14/2018

abdul-baseer
abdul-baseer 🇦🇫

4 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture 2 and 3 - Dimensional Modelling
Reading Directions
L2 [K&R] chapters 2-8
L3 [K&R] chapters 9-13, 15
Keywords
facts, attributes, dimensions, granularity, dimensional modeling,
time, semi-additive facts, dense fact tables, sparsity, skinny
fact tables, keys, slowly changing dimension, rapidly changing
dimensions, large dimensions, demographic minidimension,
degenerate dimension, junk dimension, heterogeneous products,
many-to-many relationships, factless fact table, bridge table,
family of stars, stove pipe problem, data warehouse bus, value
chains, the design process, aggregates, sparcity failure,
aggregation navigator, bitmap indexing, extended SQL, ROLAP
and MOLAP servers
Some basic concepts
•Fact
“something not known in advance”,
–an observation
many facts (but not all) have numerical, continuously
values
e.g., the price of a product, quantity
Attribute
“describe a characteristic of a tangible thing”
–“we do not measure them, we usually know them”
–usually text fields, with discrete values
e.g., the flavour of a product, the size of a product
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Dimensional Modeling and Data Warehousing and more Study Guides, Projects, Research Data Warehousing in PDF only on Docsity!

Lecture 2 and 3 - Dimensional Modelling

Reading Directions

L2 [K&R] chapters 2-

L3 [K&R] chapters 9-13, 15

Keywords

facts, attributes, dimensions, granularity, dimensional modeling,

time, semi-additive facts, dense fact tables, sparsity, skinny

fact tables, keys, slowly changing dimension, rapidly changing

dimensions, large dimensions, demographic minidimension,

degenerate dimension, junk dimension, heterogeneous products,

many-to-many relationships, factless fact table, bridge table,

family of stars, stove pipe problem, data warehouse bus, value

chains, the design process, aggregates, sparcity failure,

aggregation navigator, bitmap indexing, extended SQL, ROLAP

and MOLAP servers

Some basic concepts

• Fact

– “something not known in advance”,

– an observation

– many facts (but not all) have numerical, continuously

values

e.g., the price of a product, quantity

• Attribute

– “describe a characteristic of a tangible thing”

– “we do not measure them, we usually know them”

– usually text fields, with discrete values

e.g., the flavour of a product, the size of a product

Some basic concepts 2

• Dimension

  • a business perspective from which data is looked upon

– “a collection of text like attributes that are highly correlated”

e.g. Product, Store, Time

• Granularity

  • the level of detail of data contained in the data warehouse

e.g. Daily item totals by product, by store

Example of a Dimensional Model

Time Dimension time_key day_of_week month quarter year holiday_flag

Product Dimension product_key description brand category

Sales Fact time_key product_key store_key dollars_sold units_sold dollars_cost

Store Dimension store_key store_name address floor_plan_type

The Concept of Hierarchy

all

Europe North_America

Germany Spain Canada Mexico

Vancouver

L. Chan M. Wind

all

region

office

country

city Frankfurt Toronto

Multidimensional Data

• Sales volume as a function of product, month,

and region

Product

Store

Time

Dimensions: Product, Location, Time

Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

A Sample Data Cube

Time

Product

Country

sum

TV Sum Year/Country

VCR

PC

1Qtr 2Qtr^ 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Facts

• (Perfectly) Additive

– a fact is additive if it make sense to add it across all

the dimensions

e.g., discrete numerical measures of activity, i.e., quantity

sold, dollars soled

• Semiadditive

– a fact is semiadditive if it make sense to add it along

some of the dimensions only

e.g., numerical measures of intensity, i.e., account balance,

inventory level

• Non-additive

– facts that can not be added at all

e.g., measurement of room temperature

Numerical Measures of Intensity

• All measures that record a static level, such as account

balance and inventory level, are non-additive across

time.

• However, these measures may be usefully aggregated

across time by averaging over the number of time

periods.

• Note that, the SQL AVG can not be used for this.

  • What is the average daily inventory of a brand in a geographic region during a given week?
  • Let the brand cluster 3 products, the region has 4 stores, and we have 7 days/week.
  • Using the SQL AVG would divide the summed value into 347=
  • The correct answer is to divide the summed inventory value by 7

Skinny fact tables

• As the fact table contains the vast volume of

records it is important that it is memory space

efficient

• Foreign keys are usually represented in integer

form and do not require much memory space

• Facts too are often numeric properties and can

usually be represented as integers (contrast to

dimensional attributes which are usually long text

strings)

• Thisspace efficiency is critical to the memory

space consumption of the data warehouse

Keys

• Choice the data warehouse keys to be meaningless

surrogate keys

– Let a surrogate key be a simple integer

– 4-byte (--------,--------,--------,--------)

can contain 2^32 values (> 2 billion positive integers,

starting with 1)

Keys

• Use surrogate keys also for the Time dimension

– SQL-based date key, is typically 8 bytes, so 4 bytes

are wasted

– bypassing joins leads to embedding knowledge of the

calendar in the application, rather than reading it

from the time dimension

– it is not possible to encode a data stamp as “I do not

know”, “It has not happen yet”, etc

• Avoid smart keys

• Avoid production keys

– production may decide to reuse keys

– the company may acquire a competitor and thereby

change the key building rules

– changed record, but deliberately not changed key

Type 2

Create a new additional dimension record

  • A generalised (surrogate) key is required (which is a responsibility of the data warehouse team)

… 12334001 Mary Jones single … 12334002 Mary Jones married …

Fact table Dimension table

Type 3

Create a new field in the dimension record

Nr

First Name

Family Name

Original / Previous Marrital Status

Current Marrital Status

Effective Date 12334 Mary Jones single married 15/6 1987

Rapidly Changing Dimensions

From the previous slides: What is slow?

What if the changes are fast?

Must a different design technique be used?

• Small dimensions:

– the same technologies as for slowly changing

dimensions may be applied

• Large dimensions:

– the choice of indexing techniques and data design

approaches are important

– find suppress duplicate entries in the dimension

– do not create additional records to handle the

slowly changing dimension problem

Rapidly changing very large dimensions

• Break off some of the attributes into their own

separate dimension(s), a demographic dimension(s).

– force the attributes selected to the demographic

dimension to have relatively small number of discrete

values

– build upp the demographic dimension with all possible

discrete attributes combinations

– construct a surrogate demographic key for this

dimension

NB! The demographic attributes are the one of the

heavily used attributes. Their values are often

compared in order to identify interesting subsets.

Two Demographic Minidimensions

Customer dim customer_key relatively constant attributes …

Demographic dim demog_key demographic attributes ...

Purchase-Credit demographic dim purch_cred_dem_key purchase and credit attributes ...

Fact table … customer_key demog_key purc_cred_dem_key …

Demographic Minidimension

• Advantages

– frequent ‘snapshoting’ of customers profiles with no

increase in data storage or data complexity

• Drawbacks

– the demographic attributes are clumped into banded

ranges of discrete values (it is impractical to change

the set of value bands at a later time)

– the demographic dimension itself can not be allowed

to grow too large

– slower down the browsing

• What if the fact table (connecting the demographic

minidimension with the customer dimension) is

sparse?

Demographic Minidimension

• What to do if the fact table (connecting the

demographic minidimension with the customer

dimension) is sparse?

– Define a demographic transaction event, i.e.,

introduce a new fact table

or

– Add a current demographic key to the customer

dimension table

Degenerate Dimension

• A degenerate dimension is represented by a

dimension key attribute(s) with no corresponding

dimension table

• Occurs usually in line-item oriented fact table

design

Fact Table order_date product_key store_key … PO_number PO_line_nr

Time Dimension (^) Store Dimension

Product Dimension

Heterogeneous Products

Core Fact Table time_key account_key household_key balance checing facts … saving facts … credit card facts… safe deposit facts …

Time Dim Core Account Dim account_key type category checking attr … saving attr … credit card attr … safe deposit attr…

Household Dim

Heterogeneous Products

Custom Saving Dim account_key type category saving attr …

Cusom Checking Dim account_key type category checking attr …

Core Fact Table time_key account_key household_key balance

Custom Checking Fact time_key account_key household_key balance checking facts …

Custom Saving Fact time_key account_key household_key balance saving facts …

Core Account Dim account_key type category

Orders

Production

Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center

The Data Warehouse Bus

Dimensional modelling vs. ER-modelling

Entity-relationship modelling

  • a logical design technique to eliminate data redundancy to keep consistency and storage efficiency
  • makes transaction simple and deterministic
  • ER models for enterprise are usually complex, e.g. they often have hundreds, or even thousands, of entities/tables

Dimensional modelling

  • a logical design technique that present data in a intuitive way and that allow high-performance access
  • aims at model decision support data
  • easier to navigate for the user and high performance