Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

ETL and Advanced Modeling-Data Warehousing and Data Mining-Lecture 07 Slides-Computer Science, Appunti di Basi di Dati

ETL and Advanced Modeling, Extract-transform-load, The ETL Process, Building Dimension and Fact Tables, Extract, Transform, Load, Advanced Multidimensional Modeling, Handling Changes in Dimensions, Large-Scale Dimensional Modeling, Extracting, Cleaning and Conforming, Delivering, Managing, Plan, E-computing Delta

Tipologia: Appunti

2011/2012

Caricato il 12/03/2012

millyg
millyg 🇮🇹

4.6

(7)

34 documenti

1 / 52

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
J. Gamper, Free University of Bolzano, DWDM 2011-12
ETL and Advanced Modeling
1. Extract-Transform-Load (ETL)
The ETL process
Building dimension and fact tables
Extract, transform, load
2. Advanced Multidimensional Modeling
Handling changes in dimensions
Large-scale dimensional modeling
Acknowledgements: I am indebted to Michael Böhlen and Stefano Rizzi for providing
me their slides, upon which these lecture notes are based.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34

Anteprima parziale del testo

Scarica ETL and Advanced Modeling-Data Warehousing and Data Mining-Lecture 07 Slides-Computer Science e più Appunti in PDF di Basi di Dati solo su Docsity!

ETL and Advanced Modeling

  1. Extract-Transform-Load (ETL)
    • (^) The ETL process
    • (^) Building dimension and fact tables
    • (^) Extract, transform, load
  2. Advanced Multidimensional Modeling
    • (^) Handling changes in dimensions
    • (^) Large-scale dimensional modeling Acknowledgements: I am indebted to Michael Böhlen and Stefano Rizzi for providing me their slides, upon which these lecture notes are based.

The ETL Process

  • The most underestimated and time-consuming process in DW development  (^) Often, 80% of development time is spent on ETL
  • The ETL system is the foundation of the DW/BI project  (^) its success makes or breaks the data warehouse.
  • Extract  (^) Extract relevant data
  • (^) Transform  (^) Transform data to DW format  (^) Build keys, etc.  (^) Cleansing of data
  • Load  (^) Load data into DW  (^) Build aggregates, etc.

Data Staging Area

  • Transit storage for data underway in the ETL process  (^) Transformations/cleansing done here
  • No user queries (some do it)
  • Sequential operations (few) on large data volumes  (^) Performed by central ETL logic  (^) Easily restarted  (^) No need for locking, logging, etc.  (^) RDBMS or flat files? (DBMS have become better at this)
  • Finished dimensions copied from DSA to relevant marts

The 34 ETL Subsystems

  • Kimball et al. report 34 subsystems to compose the ETL system  (^) Extracting u (^) Gathering raw data from the source systems and usually writing it to disk in the ETL environment u (^) 3 subsystems  (^) Cleaning and conforming: u (^) Sending source data through a series of processing steps in the ETL system to improve the quality of the data and merging data u (^) 4 subsystems  (^) Delivering: u (^) Physically structuring and loading the data into the dimensional model u (^) 13 subsystems  (^) Managing (usually considered a separate component): u (^) Managing the related systems and processes of the ETL environment u (^) 13 subsystems

Building Dimensions

  • Static dimension table  (^) Relatively easy?  (^) Assignment of keys: production keys to DW keys using mapping table  (^) Combination of data sources: find common key?  (^) Check one-one and one-many relationships using sorting
  • Handling dimension changes  (^) Find newest DW key for a given production key  (^) Table for mapping production keys to DW keys must be updated
  • Load of dimensions  (^) Small dimensions: replace  (^) Large dimensions: load only changes

Building Fact Tables

  • Two types of load
  • Initial load  (^) ETL for all data up till now  (^) Done when DW is started the first time  (^) Often problematic to get correct historical data  (^) Very heavy - large data volumes
  • Incremental update  (^) Move only changes since last load  (^) Done periodically (../month/week/day/hour/...) after DW start  (^) Less heavy - smaller data volumes
  • Dimensions must be updated before facts  (^) The relevant dimension rows for new facts must be in place  (^) Special key considerations if initial load must be performed again

E - Computing Deltas

  • Much faster to only ”ETL” changes since last load  (^) A number of methods can be used
  • Store sorted total extracts in DSA  (^) Delta can easily be computed from current+last extract  (^) + Always possible  (^) + Handles deletions  (^) - Does not reduce extract time
  • Put update timestamp on all rows  (^) Updated by DB trigger  (^) Extract only where ”timestamp > time for last extract”  (^) + Reduces extract time  (^) +/- Less operational overhead  (^) - Cannot (alone) handle deletions  (^) - Source system must be changed

E - Capturing Changed Data

  • Messages  (^) Applications insert messages in a ”queue” at updates  (^) + Works for all types of updates and systems  (^) - Operational applications must be changed+operational overhead
  • DB triggers  (^) Triggers execute actions on INSERT/UPDATE/DELETE  (^) + Operational applications need not be changed  (^) + Enables real-time update of DW  (^) - Operational overhead
  • Replication based on DB log  (^) Find changes directly in DB log which is written anyway  (^) + Operational applications need not be changed  (^) + No operational overhead  (^) - Not possible in some DBMS (SQL Server, Oracle, DB2 can do it)

T – Data Transformations

  • Data type conversions  (^) EBCDIC->ASCII/UniCode  (^) String manipulations  (^) Date/time format conversions
  • Normalization/denormalization  (^) To the desired DW format  (^) Depending on source format
  • Building keys  (^) Table that maps production keys to surrogate DW keys  (^) Observe correct handling of history - especially for total reload

T - Cleansing/

  • BI does not work on ”raw” data  (^) Pre-processing necessary for good results  (^) Can ”disturb” BI analyses if not handled (e.g., duplicates)
  • Handle inconsistent data formats  (^) Spellings, codings, …
  • Remove unneeded attributes  (^) Production keys, comments,…
  • Replace codes with text  (^) City name instead of ZIP code, 0/1 by Yes/No, …
  • Combine data from multiple sources with common key  (^) Customer data,…
  • Find attributes with several uses  (^) Extra values used for ”programmer hacks”

T – Data Mining Transformations

  • Can often not be done generally for the whole DW
  • Divide data into training-, test-, and evaluation sets  (^) Training: used to train model  (^) Test: used to check the model’s generality (overfitting)  (^) Evaluation: model uses evaluation set to find ”real” clusters,…
  • Add computed attributes as inputs or ”targets”  (^) Derived attributes often more interesting for mining (profit,..)  (^) Means more possibilities for data mining
  • Mapping  (^) Continuous values to intervals, textual values to numerical
  • Normalization of values between 0 and 1  (^) Demanded by some tools (neural nets)
  • Emphasizing the rare case  (^) Duplicate rare elements in training set

T – Improving Data Quality

  • Appoint ”data stewards” - responsible for data quality  (^) A given steward has the responsibility for certain tables  (^) Includes manual inspections and corrections!
  • DW-controlled improvement  (^) Default values  (^) ”Not yet assigned 157” note to data steward
  • Source-controlled improvements  (^) The optimal?
  • Construct programs that check data quality  (^) Are totals as expected?  (^) Do results agree with alternative source?
  • Do not fix all problems with data quality  (^) Allow management to see ”weird” data in their reports?

Load/

  • Relationships in the data  (^) Referential integrity must be ensured  (^) Can be done by loader
  • (^) Aggregates  (^) Must be built and loaded at the same time as the detail data  (^) Today, RDBMSes can often do this
  • Load tuning  (^) Load without log  (^) Sort load file first  (^) Make only simple transformations in loader  (^) Use loader facilities for building aggregates  (^) Use loader within the same database
  • Should DW be on-line 24*7?  (^) Use partitions or several sets of tables

ETL Tools

  • ETL tools from the big vendors  (^) Oracle Warehouse Builder / Oracle Data Integrator  (^) IBM DB2 Warehouse Manager  (^) Microsoft Data Transformation Services
  • Offers much functionality at a reasonable price (included…)  (^) Data modeling  (^) ETL code generation  (^) Scheduling DW jobs  (^) …
  • Many others  (^) Hundreds of tools  (^) Often specialized tools for certain jobs (insurance cleansing,…)
  • The ”best” tool does not exist  (^) Choose based on your own needs  (^) Check first if the ”standard tools” from the big vendors are ok