Download Data Warehousing Lecture: Evolution, Architecture, and Benefits and more Study notes Database Management Systems (DBMS) in PDF only on Docsity!
DBDI/ Lecture 18
Data Warehousing
Dr. Ala Al-Zobaidie
The slides are mainly based on the textbook Database Systems by Thomas Connolly &Carolyn Begg and others as cited
30/05/2007 DBDI / DW 2
Lecture’s Objectives /
- An overview of Data Warehousing (DW) – evolution & concepts
- Definitions and basic features of a DW
- Market Growth & Applications
- Benefits of a DW
- Why Separate DW
- On-Line Transaction Processing (OLTP) vs. On-Line Analytical Processing (OLAP)
- DW queries
- DW System architecture
- Information flows in a DW
- Two-/Three-Tier DW Architecture
- DW Data Marts Architecture
- Data Marts
- DW Operational Process
- Data Warehouse DBMS Requirements
- DW System Design
30/05/2007 DBDI / DW 3
The Evolution of Data Warehousing
- Business process & support for decision
making using operational DBs.
- Converting operational data into business
knowledge
- Receiving data for multiple sources,
integrate them and use them for decision
support Î DW
30/05/2007 DBDI / DW 4
What, Why & How
- What?
- An analytical database that is used as the foundation of a Decision Support System (DSS).
- Designed for large volumes of read-only data, providing intuitive access to info used in making decisions.
- Why?
- Make the decision making process faster & better (executive, manager, analyst).
- How?
- Provision of a centralised corporate data.
- Loading operational data in consistent repeatable process.
- Building an open and scaleable architecture.
- Provision of tools to process the data into information.
DW is analogous to Manufacturing Process
Sales Profits
Expenses
1
2 3
4
Data Requests
Data Transformation Data Warehouse
Operational Systems Capture operational data
Transform the operational transaction data
Deliver business intelligence upon receipt of a user data request or query
Store the transformed data in a DW
DW — Definition
- A Decision Support DB that is maintained
separately from the organisation's
operational Databases.
Subject-oriented ,
Integrated ,
Time-variant &
Non-volatile
collection of data that is used primarily in
organizational decision making.
30/05/2007 DBDI / DW 7
Subject-oriented Data
- The warehouse is organized around the
major subjects of the enterprise (e.g.
customers, products, and sales) rather than
the major application areas (e.g. customer
invoicing, stock control, and product sales).
- This is reflected in the need to store
decision-support data rather than
application-oriented data.
30/05/2007 DBDI / DW 8
Integrated Data
- The data warehouse integrates corporate
application-oriented data from different
source systems, which often includes data
that is inconsistent.
- The integrated data source must be made
consistent to present a unified view of the
data to the users.
30/05/2007 DBDI / DW 9
Time-variant Data
- Data in the warehouse is only accurate and valid at some point in time or over some time interval.
- Time-variance is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots.
- Warehouse data represent the time, and queries are based on time range.
30/05/2007 DBDI / DW 10
Non-volatile Data
- Data in the warehouse is not updated in
real-time but is refreshed from operational
systems on a regular basis.
- New data is always added as a
supplement to the database, rather than a
replacement.
Market growth
- Prediction:
- $2B in 1995 B $8B in 1998 [ Meta Group ]
- $1.5B in 1996 B $6.9B in 1998 [ Gartner Group ]
- @9% compound annual growth B $13.5 billion in 2009 [IDC]
- Reality
- During 1997, users spent nearly $15B on data warehousing worldwide [NewsEdge]
- OLAP market continued to grow B 16.4% in 2006 [OLAP]
Applications
- Manufacturing;
- Retail;
- Financial services;
- Insurance;
- Transportation;
- Telecommunications;
- Utilities;
- Healthcare; etc.
30/05/2007 DBDI / DW 19
Architecture Components /
- Operational Data Sources
- Hierarchical, Network & relational databases
- propriety file systems
- Load Manager
- Extraction & loading
- Various data loading tools & custom-built programs
- Warehouse Manager
- Perform operations
- Generate Query Profile 30/05/2007 DBDI / DW 20
Architecture Components /
- Query Manager
- Performs user queries
- Interface with various tools
- Directing & scheduling queries
- generates query profiles
- Detailed Data
- Captured from DB Schema
- Most cases data is aggregated
- Add data to supplement the aggregated data
30/05/2007 DBDI / DW 21
Architecture Components /
- Lightly & Highly Summarized Data
- Stores pre-defined lightly & highly aggregated data.
- Transient data
- The purpose is performance.
- updated continuously
- Archive / Backup Data
- archiving and backup.
- backup online summary data
- Permanent storage. 30/05/2007 DBDI / DW 22
Architecture Components /
Metadata
- Types
- End-user Metadata & Management Metadata
- Usage
- Extraction and loading, Warehouse management & Query management processes.
- Structure of metadata
- Mapping Rules for Copy management tools
- End-user query builder tools
- Complex operation
Architecture Components /
End-user Access Tools
- strategic decision-making through end-user access tools.
- DW support ad hoc and routine analysis.
- Enhance performance
- Groups of access tools
- Data reporting and query tools
- Application development tools
- Executive information system (EIS) tools
- Online analytical processing (OLAP) tools
- Data mining tools
Data Warehouse Information Flows
30/05/2007 DBDI / DW 25
Two-/Three-Tier Architecture /
Data Warehouse
Legacy System Relational System
Purchased Data
Data Source (^) Data Marts Tools
Data Warehouse
Extract Transform Load Refresh
Metadata Repository
Serve
Data Mining
Query/ Reporting
OLAP Servers
Analysis
Monitoring & Administration
30/05/2007 DBDI / DW 26
Two-/Three-Tier Architecture /
Operational System Data Source
The Data Warehouse
User Query
DepartmentalData Marts
Transformation & Scrubbing of Data
Multidimensional Data Marts
Back EndBack End ProcessProcess
MidMid--TierTier ProcessProcess
Front EndFront End ProcessProcess
30/05/2007 DBDI / DW 27
Typical Data Warehouse & Data Mart
Architecture
30/05/2007 DBDI / DW 28
Data Mart
- A subset of a DW for particular Dept or business function.
- Characteristics include
- Focuses on one dept or business functions requirements.
- Summarised operational data
- understood and navigated.
- Users access data they need more often
- Provides collection View
- improve end-user response time
Reasons for Creating a Data Mart
- Provide appropriately structured data
- Building a data mart is simpler
- The cost is normally less
- users are more clearly defined
Data Marts Issues
- Data mart functionality
- Data mart size
- Data mart load performance
- Users access to data in multiple data
marts
- Data mart Internet / Intranet access
- Data mart administration
- Data mart installation
30/05/2007 DBDI / DW 37
Data Warehouse DBMS Requirements
- Load performance
- Load processing
- Data quality management
- Query performance
- Terabyte scalability
- Mass user scalability
- Networked data warehouse
- Warehouse administration
- Integrated dimensional analysis
- Advanced query functionality
30/05/2007 DBDI / DW 38
Lecture’s Objectives /
- Designing Data Warehouses -- Associated issues
- Conceptual Model & OLAP
- Dimensionality modelling (DM)
- Schema Design
- DM vs. ER modelling
- Methodology for designing a DW
- Problems of Data Warehousing
30/05/2007 DBDI / DW 39
Designing Data Warehouses
- Which user requirements are most important?
- Which data should be considered first?
- Should the project be scaled down into something more manageable?
- Should the infrastructure for a scaled down project be capable of ultimately delivering a full- scale enterprise-wide data warehouse?
30/05/2007 DBDI / DW 40
Designing Data Warehouses
- Start with one or more data marts.
- Simpler
- Achievable for a specific group of users.
- Ultimately, the DW should support the requirements at one time.
- Enterprise-wide design must meet all user requirements
Designing Data Warehouses
- Requirements collection & analysis
involves:
- Interviewing various staff members to prioritized requirements
- Identify data sources to provide clean, valid, and consistent data
- Top-down (user requirements) vs. bottom-up view (identifying data source)
Conceptual Model for OLAP
- Several numeric measures that are analysed
- sales, budget, revenue, inventory
- Dimensions
- context in which a measure appears
- example:
- store, product, data information associated with a sale
- each context is a dimension and the measure is a point in a multidimensional world
- Nature of Analysis
- aggregation
- comparison
- ranking
- etc.
Region Month
Product
30/05/2007 DBDI / DW 43
Multidimensional Data
Geography S N^ WE
1 2 3 4 5 6 Month
Product
Juice Cola Milk Cream Soap
10 50 20 12 10
Sales volume as a function of product, time & Region Roll up: Summarisation over aggregate hierarchy E.g. total sales volume last year Æ by product category by region
Roll down: Go from higher level summary to lower level summary or detailed data E.g. total sales volume last year & by region Æ by month by product category
Slice & Dice (i.e. pivoting): refers to the ability to look at the data from different viewpoints) (Select & project) E.g. sales of Cola in the West over the last six months; E.g. top 5 products by total sales
30/05/2007 DBDI / DW 44
Dimensionality modelling
- A logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access
- Uses restricted Entity-Relationship modelling.
- Every DM is composed of one fact table & a set of dimension tables.
30/05/2007 DBDI / DW 45
Schema Design: Dimension Tables
- The data model could also be a graphical description of Facts & Dimensions tables B Star Schema
FACTS Sales Inventory Costs
Product
Time
Geography
Distribution Channel
Dimensions Year ¾ Quarter ¾ Month/week ¾ Day
Country ¾ Region ¾ City ¾ Office
30/05/2007 DBDI / DW 46
Schema Design: Star Schema
- A single fact table, & for each dimension one single table
- The fact table PK is a composite PK created from the dimensions PKs.
- Does not capture hierarchies directly
- Generated keys are used for performance & maintenancereasons
- Star Schema or Star Join
- Surrogate Keys vs natural keys
FACT Table ProdNo ProdName ProdDescr Category CategoryDescr UnitPrice QOH
Product OrderNo ProdNo SalespersonID CustomerNo Quantity TotalPrice
OrderNo OrderDate
Order
CustomerNo CustomerName CustomerAddre City
Customer
SalespersonID SalespersonName City Quota
Salesperson
Fact Table
- Mostly row numeric items, relevant measures only.
- Skinny rows, a few columns at most.
- From millions to more than a billion rows.
- Items are typically additive.
- Access via dimensions.
Central & Dimensions tables
Dimension Tables
- Define business dimensions.
- Highly descriptive.
- Joined through FK.
- Occurs in certain clauses.
- Heavily indexed.
FactsFactsFacts
Order Product
Customer
Salesperson
Region
Star schema for property sales of DreamHome
30/05/2007 DBDI / DW 55
ER model of property sales business
process of DreamHome
30/05/2007 DBDI / DW 56
Nine-Step Methodology / 3-
Step 3: Identifying & conforming dimensions
- Check dimensions context & redundancy
- Identify conformed dimensions
Step 4: Choosing the facts
- Determines which facts can be used in data mart.
- Facts should be numeric and additive.
- Identify Unusable facts
30/05/2007 DBDI / DW 57
Star schemas for property sales and
property advertising
30/05/2007 DBDI / DW 58
Property rentals with a badly
structured fact table
Property rentals with fact table corrected
Nine-Step Methodology / 5-
Step 5: Storing pre-calculations in the fact table
- re-examine facts for possible pre-calculations.
Step 6: Rounding out the dimension tables
- Add intuitive & understandable text descriptions
- Identify uses of data marts
Step 7: Choosing the duration of the database
- Identify the starting time of fact table
- Resolve problems of sourcing increasing old data & slowly changing dimension
30/05/2007 DBDI / DW 61
Nine-Step Methodology / 8-
Step 8: Tracking slowly changing dimensions
- old dimension data must be used with the old fact data.
- distinguish multiple snapshots of dimensions over a period of time.
- The three basic types of slowly changing dimensions
Step 9: Deciding the query priorities and the query modes
- physical design issues
- pre-stored summaries or aggregations
30/05/2007 DBDI / DW 62
Database Design Methodology for Data
Warehouses
- Forming the enterprise-wide data warehouse.
- A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables Î referred to as a fact constellation.
- Types of DW schema:
- Star Schema
- Fact table (can be de-normalized) with fully de-normalized dimension tables
- Snowflake Schema
- Fact table with fully normalized dimension tables
- Starflake Schema
- Fact table with fully de-normalized dimension and (as needed) sub- dimension tables
- Constellation Schema
- Multiple fact tables with shared dimension tables
30/05/2007 DBDI / DW 63
Dimensional model (fact constellation) for
the DreamHome data warehouse
30/05/2007 DBDI / DW 64
Fact and dimension tables for each
business process of DreamHome
Problems of Data Warehousing
- Underestimation of resources for data loading
- Hidden problems with source systems
- Required data not captured
- Increased end-user demands
- Data homogenization
- High demand for resources
- Data ownership
- High maintenance
- Long duration projects
- Complexity of integration
Summary
- DW solution must address all aspects loading, storing, managing & querying
- Commercial products have been around for years — market is expanding rapidly
- Rapid Market Growth & Many Applications
- Impact on strategic decision making
- OLTP vs. OLAP
- DW System architecture (2/3 tier architecture)
- Data Marts Architecture
- Conceptual Model & OLAP
- Dimensionality modelling (DM)
- Schema Design (Snowflake, Star, Starflake & Constellation Schemas)
- DM vs. ER modelling
- Methodology for designing a DW
- Problems of Data Warehousing