Data Warehousing Lecture: Evolution, Architecture, and Benefits, Study notes of Database Management Systems (DBMS)

A lecture script on data warehousing (dw). It covers the evolution and concepts of dw, its benefits, differences between oltp and olap, dw system architecture, and data mart architecture. The lecture also discusses the importance of data cleaning, migration, and auditing in dw operations.

Typology: Study notes

2010/2011

Uploaded on 09/08/2011

rossi46
rossi46 🇬🇧

4.5

(10)

313 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DBDI 30/05/2007
Lecture-18 / Data Warehousing 1
DBDI/ Lecture 18
Data Warehousing
Dr. Ala Al-Zobaidie
The slides are mainly based on the textbook Database Systems by Thomas Connolly &
Carolyn Begg and others as cited
30/05/2007 DBDI / DW 2
Lecture’s Objectives /1
An overview of Data Warehousing (DW) – evolution &
concepts
Definitions and basic features of a DW
Market Growth & Applications
Benefits of a DW
Why Separate DW
On-Line Transaction Processing (OLTP) vs. On-Line
Analytical Processing (OLAP)
DW queries
DW System architecture
Information flows in a DW
Two-/Three-Tier DW Architecture
DW Data Marts Architecture
Data Marts
DW Operational Process
Data Warehouse DBMS Requirements
DW System Design
30/05/2007 DBDI / DW 3
The Evolution of Data Warehousing
Business process & support for decision
making using operational DBs.
Converting operational data into business
knowledge
Receiving data for multiple sources,
integrate them and use them for decision
support ÎDW
30/05/2007 DBDI / DW 4
What, Why & How
What?
An analytical database that is used as the foundation of a
Decision Support System (DSS).
Designed for large volumes of read-only data, providing
intuitive access to info used in making decisions.
•Why?
Make the decision making process faster & better (executive,
manager, analyst).
•How?
Provision of a centralised corporate data.
Loading operational data in consistent repeatable process.
Building an open and scaleable architecture.
Provision of tools to process the data into information.
30/05/2007 DBDI / DW 5
DW is analogous to Manufacturing Process
Profits
Sales
Expenses
1
23
4
Data
Requests
Data Transformation
Data Warehouse
Operational Systems
Capture operational
data
Transform the
operational
transaction data
Deliver business
intelligence upon
receipt of a user data
request or query
Store the transformed data
in a DW
30/05/2007 DBDI / DW 6
DW — Definition
A Decision Support DB that is maintained
separately from the organisation's
operational Databases.
•A DW is:
Subject-oriented,
Integrated,
Time-variant &
Non-volatile
collection of data that is used primarily in
organizational decision making.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Data Warehousing Lecture: Evolution, Architecture, and Benefits and more Study notes Database Management Systems (DBMS) in PDF only on Docsity!

DBDI/ Lecture 18

Data Warehousing

Dr. Ala Al-Zobaidie

The slides are mainly based on the textbook Database Systems by Thomas Connolly &Carolyn Begg and others as cited

30/05/2007 DBDI / DW 2

Lecture’s Objectives /

  • An overview of Data Warehousing (DW) – evolution & concepts
  • Definitions and basic features of a DW
  • Market Growth & Applications
  • Benefits of a DW
  • Why Separate DW
  • On-Line Transaction Processing (OLTP) vs. On-Line Analytical Processing (OLAP)
  • DW queries
  • DW System architecture
  • Information flows in a DW
  • Two-/Three-Tier DW Architecture
  • DW Data Marts Architecture
  • Data Marts
  • DW Operational Process
  • Data Warehouse DBMS Requirements
  • DW System Design

30/05/2007 DBDI / DW 3

The Evolution of Data Warehousing

  • Business process & support for decision

making using operational DBs.

  • Converting operational data into business

knowledge

  • Receiving data for multiple sources,

integrate them and use them for decision

support Î DW

30/05/2007 DBDI / DW 4

What, Why & How

  • What?
    • An analytical database that is used as the foundation of a Decision Support System (DSS).
    • Designed for large volumes of read-only data, providing intuitive access to info used in making decisions.
  • Why?
    • Make the decision making process faster & better (executive, manager, analyst).
  • How?
    • Provision of a centralised corporate data.
    • Loading operational data in consistent repeatable process.
    • Building an open and scaleable architecture.
    • Provision of tools to process the data into information.

DW is analogous to Manufacturing Process

Sales Profits

Expenses

1

2 3

4

Data Requests

Data Transformation Data Warehouse

Operational Systems Capture operational data

Transform the operational transaction data

Deliver business intelligence upon receipt of a user data request or query

Store the transformed data in a DW

DW — Definition

  • A Decision Support DB that is maintained

separately from the organisation's

operational Databases.

  • A DW is:

Subject-oriented ,

Integrated ,

Time-variant &

Non-volatile

collection of data that is used primarily in

organizational decision making.

30/05/2007 DBDI / DW 7

Subject-oriented Data

  • The warehouse is organized around the

major subjects of the enterprise (e.g.

customers, products, and sales) rather than

the major application areas (e.g. customer

invoicing, stock control, and product sales).

  • This is reflected in the need to store

decision-support data rather than

application-oriented data.

30/05/2007 DBDI / DW 8

Integrated Data

  • The data warehouse integrates corporate

application-oriented data from different

source systems, which often includes data

that is inconsistent.

  • The integrated data source must be made

consistent to present a unified view of the

data to the users.

30/05/2007 DBDI / DW 9

Time-variant Data

  • Data in the warehouse is only accurate and valid at some point in time or over some time interval.
  • Time-variance is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots.
  • Warehouse data represent the time, and queries are based on time range.

30/05/2007 DBDI / DW 10

Non-volatile Data

  • Data in the warehouse is not updated in

real-time but is refreshed from operational

systems on a regular basis.

  • New data is always added as a

supplement to the database, rather than a

replacement.

Market growth

  • Prediction:
    • $2B in 1995 B $8B in 1998 [ Meta Group ]
    • $1.5B in 1996 B $6.9B in 1998 [ Gartner Group ]
    • @9% compound annual growth B $13.5 billion in 2009 [IDC]
  • Reality
    • During 1997, users spent nearly $15B on data warehousing worldwide [NewsEdge]
    • OLAP market continued to grow B 16.4% in 2006 [OLAP]

Applications

  • Manufacturing;
  • Retail;
  • Financial services;
  • Insurance;
  • Transportation;
  • Telecommunications;
  • Utilities;
  • Healthcare; etc.

30/05/2007 DBDI / DW 19

Architecture Components /

  • Operational Data Sources
    • Hierarchical, Network & relational databases
    • propriety file systems
  • Load Manager
    • Extraction & loading
    • Various data loading tools & custom-built programs
  • Warehouse Manager
    • Perform operations
    • Generate Query Profile 30/05/2007 DBDI / DW 20

Architecture Components /

  • Query Manager
    • Performs user queries
    • Interface with various tools
    • Directing & scheduling queries
    • generates query profiles
  • Detailed Data
    • Captured from DB Schema
    • Most cases data is aggregated
    • Add data to supplement the aggregated data

30/05/2007 DBDI / DW 21

Architecture Components /

  • Lightly & Highly Summarized Data
    • Stores pre-defined lightly & highly aggregated data.
    • Transient data
    • The purpose is performance.
    • updated continuously
  • Archive / Backup Data
    • archiving and backup.
    • backup online summary data
    • Permanent storage. 30/05/2007 DBDI / DW 22

Architecture Components /

Metadata

  • Types
    • End-user Metadata & Management Metadata
  • Usage
    • Extraction and loading, Warehouse management & Query management processes.
  • Structure of metadata
  • Mapping Rules for Copy management tools
  • End-user query builder tools
  • Complex operation

Architecture Components /

End-user Access Tools

  • strategic decision-making through end-user access tools.
  • DW support ad hoc and routine analysis.
  • Enhance performance
  • Groups of access tools
    • Data reporting and query tools
    • Application development tools
    • Executive information system (EIS) tools
    • Online analytical processing (OLAP) tools
    • Data mining tools

Data Warehouse Information Flows

30/05/2007 DBDI / DW 25

Two-/Three-Tier Architecture /

Data Warehouse

Legacy System Relational System

Purchased Data

Data Source (^) Data Marts Tools

Data Warehouse

Extract Transform Load Refresh

Metadata Repository

Serve

Data Mining

Query/ Reporting

OLAP Servers

Analysis

Monitoring & Administration

30/05/2007 DBDI / DW 26

Two-/Three-Tier Architecture /

Operational System Data Source

The Data Warehouse

User Query

DepartmentalData Marts

Transformation & Scrubbing of Data

Multidimensional Data Marts

Back EndBack End ProcessProcess

MidMid--TierTier ProcessProcess

Front EndFront End ProcessProcess

30/05/2007 DBDI / DW 27

Typical Data Warehouse & Data Mart

Architecture

30/05/2007 DBDI / DW 28

Data Mart

  • A subset of a DW for particular Dept or business function.
  • Characteristics include
    • Focuses on one dept or business functions requirements.
    • Summarised operational data
    • understood and navigated.
  • Users access data they need more often
  • Provides collection View
  • improve end-user response time

Reasons for Creating a Data Mart

  • Provide appropriately structured data
  • Building a data mart is simpler
  • The cost is normally less
  • users are more clearly defined

Data Marts Issues

  • Data mart functionality
  • Data mart size
  • Data mart load performance
  • Users access to data in multiple data

marts

  • Data mart Internet / Intranet access
  • Data mart administration
  • Data mart installation

30/05/2007 DBDI / DW 37

Data Warehouse DBMS Requirements

  • Load performance
  • Load processing
  • Data quality management
  • Query performance
  • Terabyte scalability
  • Mass user scalability
  • Networked data warehouse
  • Warehouse administration
  • Integrated dimensional analysis
  • Advanced query functionality

30/05/2007 DBDI / DW 38

Lecture’s Objectives /

  • Designing Data Warehouses -- Associated issues
  • Conceptual Model & OLAP
  • Dimensionality modelling (DM)
  • Schema Design
  • DM vs. ER modelling
  • Methodology for designing a DW
  • Problems of Data Warehousing

30/05/2007 DBDI / DW 39

Designing Data Warehouses

  • Which user requirements are most important?
  • Which data should be considered first?
  • Should the project be scaled down into something more manageable?
  • Should the infrastructure for a scaled down project be capable of ultimately delivering a full- scale enterprise-wide data warehouse?

30/05/2007 DBDI / DW 40

Designing Data Warehouses

  • Start with one or more data marts.
    • Simpler
    • Achievable for a specific group of users.
  • Ultimately, the DW should support the requirements at one time.
  • Enterprise-wide design must meet all user requirements

Designing Data Warehouses

  • Requirements collection & analysis

involves:

  • Interviewing various staff members to prioritized requirements
  • Identify data sources to provide clean, valid, and consistent data
  • Top-down (user requirements) vs. bottom-up view (identifying data source)

Conceptual Model for OLAP

  • Several numeric measures that are analysed
    • sales, budget, revenue, inventory
  • Dimensions
    • context in which a measure appears
    • example:
      • store, product, data information associated with a sale
    • each context is a dimension and the measure is a point in a multidimensional world
  • Nature of Analysis
    • aggregation
    • comparison
    • ranking
    • etc.

Region Month

Product

30/05/2007 DBDI / DW 43

Multidimensional Data

Geography S N^ WE

1 2 3 4 5 6 Month

Product

Juice Cola Milk Cream Soap

10 50 20 12 10

Sales volume as a function of product, time & Region Roll up: Summarisation over aggregate hierarchy E.g. total sales volume last year Æ by product category by region

Roll down: Go from higher level summary to lower level summary or detailed data E.g. total sales volume last year & by region Æ by month by product category

Slice & Dice (i.e. pivoting): refers to the ability to look at the data from different viewpoints) (Select & project) E.g. sales of Cola in the West over the last six months; E.g. top 5 products by total sales

30/05/2007 DBDI / DW 44

Dimensionality modelling

  • A logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access
  • Uses restricted Entity-Relationship modelling.
  • Every DM is composed of one fact table & a set of dimension tables.

30/05/2007 DBDI / DW 45

Schema Design: Dimension Tables

  • The data model could also be a graphical description of Facts & Dimensions tables B Star Schema

FACTS Sales Inventory Costs

Product

Time

Geography

Distribution Channel

Dimensions Year ¾ Quarter ¾ Month/week ¾ Day

Country ¾ Region ¾ City ¾ Office

30/05/2007 DBDI / DW 46

Schema Design: Star Schema

  • A single fact table, & for each dimension one single table
  • The fact table PK is a composite PK created from the dimensions PKs.
  • Does not capture hierarchies directly
  • Generated keys are used for performance & maintenancereasons
  • Star Schema or Star Join
  • Surrogate Keys vs natural keys

FACT Table ProdNo ProdName ProdDescr Category CategoryDescr UnitPrice QOH

Product OrderNo ProdNo SalespersonID CustomerNo Quantity TotalPrice

OrderNo OrderDate

Order

CustomerNo CustomerName CustomerAddre City

Customer

SalespersonID SalespersonName City Quota

Salesperson

Fact Table

  • Mostly row numeric items, relevant measures only.
  • Skinny rows, a few columns at most.
  • From millions to more than a billion rows.
  • Items are typically additive.
  • Access via dimensions.

Central & Dimensions tables

Dimension Tables

  • Define business dimensions.
  • Highly descriptive.
  • Joined through FK.
  • Occurs in certain clauses.
  • Heavily indexed.

FactsFactsFacts

Order Product

Customer

Salesperson

Region

Star schema for property sales of DreamHome

30/05/2007 DBDI / DW 55

ER model of property sales business

process of DreamHome

30/05/2007 DBDI / DW 56

Nine-Step Methodology / 3-

Step 3: Identifying & conforming dimensions

  • Check dimensions context & redundancy
  • Identify conformed dimensions

Step 4: Choosing the facts

  • Determines which facts can be used in data mart.
  • Facts should be numeric and additive.
  • Identify Unusable facts

30/05/2007 DBDI / DW 57

Star schemas for property sales and

property advertising

30/05/2007 DBDI / DW 58

Property rentals with a badly

structured fact table

Property rentals with fact table corrected

Nine-Step Methodology / 5-

Step 5: Storing pre-calculations in the fact table

  • re-examine facts for possible pre-calculations.

Step 6: Rounding out the dimension tables

  • Add intuitive & understandable text descriptions
  • Identify uses of data marts

Step 7: Choosing the duration of the database

  • Identify the starting time of fact table
  • Resolve problems of sourcing increasing old data & slowly changing dimension

30/05/2007 DBDI / DW 61

Nine-Step Methodology / 8-

Step 8: Tracking slowly changing dimensions

  • old dimension data must be used with the old fact data.
  • distinguish multiple snapshots of dimensions over a period of time.
  • The three basic types of slowly changing dimensions

Step 9: Deciding the query priorities and the query modes

  • physical design issues
  • pre-stored summaries or aggregations

30/05/2007 DBDI / DW 62

Database Design Methodology for Data

Warehouses

  • Forming the enterprise-wide data warehouse.
  • A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables Î referred to as a fact constellation.
  • Types of DW schema:
    • Star Schema
      • Fact table (can be de-normalized) with fully de-normalized dimension tables
    • Snowflake Schema
      • Fact table with fully normalized dimension tables
    • Starflake Schema
      • Fact table with fully de-normalized dimension and (as needed) sub- dimension tables
    • Constellation Schema
      • Multiple fact tables with shared dimension tables

30/05/2007 DBDI / DW 63

Dimensional model (fact constellation) for

the DreamHome data warehouse

30/05/2007 DBDI / DW 64

Fact and dimension tables for each

business process of DreamHome

Problems of Data Warehousing

  • Underestimation of resources for data loading
  • Hidden problems with source systems
  • Required data not captured
  • Increased end-user demands
  • Data homogenization
  • High demand for resources
  • Data ownership
  • High maintenance
  • Long duration projects
  • Complexity of integration

Summary

  • DW solution must address all aspects loading, storing, managing & querying
  • Commercial products have been around for years — market is expanding rapidly
  • Rapid Market Growth & Many Applications
  • Impact on strategic decision making
  • OLTP vs. OLAP
  • DW System architecture (2/3 tier architecture)
  • Data Marts Architecture
  • Conceptual Model & OLAP
  • Dimensionality modelling (DM)
  • Schema Design (Snowflake, Star, Starflake & Constellation Schemas)
  • DM vs. ER modelling
  • Methodology for designing a DW
  • Problems of Data Warehousing