Data warehousing & data mining (sjbit) [10is74], Study notes of Data Structures and Algorithms

computer science 7th sem

Typology: Study notes

2014/2015

Uploaded on 07/12/2015

Mariyappan
Mariyappan 🇮🇳

4

(1)

1 document

1 / 143

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Warehousing & DataMinig
10IS74
Dept. of ISE, SJBIT
Page 1
DATA WAREHOUSING AND DATA MINING
PART A
UNIT 1
Data Warehousing:
6 Hours
I
n
t
r
od
u
c
t
i
o
n
,
O
pe
r
a
t
i
o
na
l
D
a
t
a
S
to
r
e
s
(
O
D
S
)
,
E
x
t
r
a
c
t
i
o
n
T
r
an
s
f
o
r
m
a
t
i
o
n
L
o
a
d
i
n
g
(
E
T
L
)
,
D
a
t
a
W
a
r
e
h
o
u
s
e
s
.
Design
I
ss
u
e
s
,
Guidelines
for
Data Warehouse
I
m
p
l
e
m
e
n
t
a
t
i
o
n
,
Data Warehouse
M
et
a
d
a
t
a
.
UNIT 2 6 Hours
Online
Analytical
Processing
(OLAP):
I
n
t
r
od
u
c
t
i
o
n
,
C
ha
r
a
c
te
r
i
s
t
i
c
s
of
O
L
A
P
s
y
s
te
ms
,
M
u
l
t
i
d
i
m
e
n
s
i
o
na
l
view
and
D
a
t
a
cube,
D
a
t
a
Cube
I
m
p
l
e
m
e
n
t
a
t
i
o
n
s
,
D
a
t
a
Cube
ope
r
a
t
i
o
n
s
,
I
m
p
l
e
m
e
n
t
a
t
i
o
n
of
OLAP
and
overview on
OLAP S
o
f
t
w
a
r
e
s
.
UNIT 3 6 Hours
Data
Mining:
I
n
t
r
od
u
c
t
i
o
n
,
C
ha
ll
e
n
ge
s
,
D
a
t
a
Mining
Tasks,
Types
of D
a
t
a
,
D
a
t
a
Preprocessing,
Measures
of
Similarity
and
Dissimilarity, Data Mining
A
pp
li
c
a
t
i
o
n
s
UNIT 4 8 Hours
Association Analysis: Basic Concepts and Algorithms:
F
r
eq
u
e
n
t
I
te
ms
et
G
e
n
e
r
a
t
i
o
n
,
Rule
G
e
n
e
r
a
t
i
o
n
,
Compact
R
ep
r
e
s
e
n
t
a
t
i
o
n
of
F
r
eq
u
e
n
t
I
te
ms
et
s
,
Alternative
methods
for
ge
n
e
r
a
t
i
n
g
F
r
eq
u
e
n
t
I
te
ms
et
s
,
F
P
Growth
A
l
go
r
i
t
h
m
,
E
v
a
l
ua
t
i
o
n
of
Association
P
a
tte
r
n
s
UNIT 5
PART - B
6 Hours
Classification -1 : Basics,
General
approach to
solve classification
p
r
ob
l
e
m
,
Decision
Trees, Rule Based
Classifiers,
Nearest
N
e
i
g
h
bo
r
Classifiers.
UNIT 6
Classification - 2
: Bayesian
Classifiers,
E
s
t
i
m
a
t
i
n
g
P
r
ed
i
c
t
i
v
e
accuracy of
c
l
a
ss
i
f
i
c
a
t
i
o
n
6 Hours
m
et
h
od
s
,
I
m
p
r
o
v
i
n
g
accuracy
of
clarification
m
et
h
od
s
,
E
v
a
l
ua
t
i
o
n
criteria
for
classification
m
et
h
od
s
,
M
u
l
t
i
c
l
a
ss
Problem.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Data warehousing & data mining (sjbit) [10is74] and more Study notes Data Structures and Algorithms in PDF only on Docsity!

DATA WAREHOUSING AND DATA MINING

PART – A

UNIT – 1

Data Warehousing:

6 Hours

Int r od u ct i o n, O pe ra t i o na l Da t a S to r es ( O DS ), E x t ra ct i o n Tran s f o rm a t io n Lo a d i n g (E T L) , Da t a W a r e ho u s es. Design Iss u es , Guidelines for Data Warehouse Im p l e m ent a t i o n , Data Warehouse M et a d a t a.

UNIT – 2 6 Hours Online Analytical Processing (OLAP): Int r od u ct i o n, C ha ra cte r i s t i cs of O L A P s ys tems , M u lt i d im e ns i o na l view and Da t a cube, Da t a Cube Im p l e m e nt a t i o ns , Da t a Cube ope ra t i o ns , Im p l e m e nt a t i o n of OLAP and overview on OLAP S o ft wa r e s.

UNIT – 3 6 Hours Data Mining: Int^ rod^ u^ c^ t^ i^ o^ n^ ,^ C^ ha^ ll^ e^ nges^ ,^ Da^ t^ a^ Mining^ Tasks,^ Types^ of Da^ t^ a^ ,^ Da^ t^ a^ Preprocessing, Measures of Similarity and Dissimilarity, Data Mining A pp li ca t i o ns

UNIT – 4 8 Hours Association Analysis: Basic Concepts and Algorithms: Fr eq u e nt Items et G e n e ra t i o n , Rule G e n e ra t i o n, Compact R ep r es e nt a t i o n of F r eq u e nt Ite ms et s , Alternative methods for ge n e ra t i n g Fr eq u e nt Ite ms et s , FP Growth A l go r it hm ,E v a l ua t i o n of Association P a tte rn s

UNIT – 5

PART - B

6 Hours Classification -1 : Basics, General approach to solve classification p r ob l e m, Decision Trees, Rule Based Classifiers, Nearest N ei g h bo r Classifiers.

UNIT – 6 Classification - 2 : Bayesian Classifiers, E s t ima t i n g P r ed i ct iv e accuracy of c la ss if i ca t i o n

6 Hours m et h od s , Im p r o v i n g accuracy of clarification m et h od s , E v a l ua t i o n criteria for classification m et hod s , M u lt i c l a ss Problem.

UNIT – 7 8 Hours Clustering Techniques: Overview, Fea t u r es of c l us te r analysis, Types of Data and C o m p ut i n g D is t an c e, Types of Cluster Analysis M et h od s , Partitional M et h od s , Hierarchical M et h od s , Density Based M et h od s , Quality and V a li d it y of Cluster Analysis.

UNIT – 8 6 Hours Web Mining: Int rod u ct i o n, Web c o ntent mining, T ex t Mining, U ns t r u ct ur ed T ex t , Te x t c l us ter in g , Mining Spatial and Te m po ra l Da t a b a s es.

Text Books:

  1. P an g- N i n g Tan, Michael Steinbach, Vipin K u ma r: Int r od u ct i o n to Data Mining, Pearson E d u ca t i o n ,
  2. G. K. G upt a : Int r od uct i o n to Data Mining with Case Studies, 3 r d E d it i o n , PHI, New Delhi, 2009. Reference Books:
  3. Arun K P uja r i: Data Mining Techniques, 2nd Edition, UniversitiesPress, 2009.
  4. Jiawei Han and Micheline K a m be r: Data Mining - Concepts and Techniques, 2nd Edition, Morgan K auf mann Publisher, 2006.
  5. Alex Berson and Stephen J. S mit h: Data Warehousing,

UNIT – 7 Clustering Techniques:

  • 4.3Rule Generation,
  • 4.4 Compact Representation of Frequent Itemsets,
  • 4.5 Alternative methods for generating Frequent Itemsets,
  • 4.6 FP Growth Algorithm,Evaluation of Association Patterns
  • UNIT – 5 & UNIT –
  • 5.1Classification - 1: Basics,
  • 5.2 General approach to solve classification problem,
  • 5.3 Decision Trees,
  • 5.4 Rule Based Classifiers,
  • 5.5 Nearest Neighbor Classifiers.
  • 5 .6 Classification - 2 : Bayesian Classifiers,
  • 7.1Overview,
  • 7.2 Features of cluster analysis,
  • 7.3 Types of Data and Computing Distance,
    • Based Methods,
  • 7.5 Quality and Validity of Cluster Analysis.
  • 8 1 Introduction, UNIT – 8 Web Mining:
  • 8.2 Web content mining,
  • 8.3 Text Mining,
  • 8 4 Unstructured Text,
  • 8.5 Text clustering,
  • 8.6 Mining Spatial and Temporal Databases.

UNIT 1

DATA WAREHOUSING

11 INTRODUCTION

Major e nter p r is e s have many computers that run a variety of e nter p r is e a pp li ca t i o ns. For an e nter p ris e wit h branches in many l o ca t io n s , t h e branches may have t h e ir own systems. For example, in a university with only one ca m p u s , the library may run its own ca t a l og and bo rr o w i n g database system while the student a d m in i s t ra t io n may have own s ys te ms running on an ot h e r ma ch i n e. There m ig ht be a separate finance system, a p r opert y and facilities mana ge m e nt s ys te m a n d a n o th er fo r hu man res o u rces mana gem e nt. A l a rge company mi g ht have the following s ys te m. · Human Resources · Financials · Billi ng · Sales leads · Web s a l es · Customer s u ppo rt Such s ys te ms are called online t rans a ct i o n p r o c e ss i n g ( O LT P ) s ys te ms. The O LT P systems are mostly r e la t i o na l database systems de s i g n ed for t rans a ct i o n p ro c ess i n g. Th e performance of OLTP systems is usually very i m po rt ant since such systems are used to support the users ( i. e. staff) that p ro v i de service to the c u s to m er s. The s ys te ms

P op ula t i n g an ODS involves an a c q u is it i o n process of ex t r a ct i n g , t rans f o r m i n g an d l o a d i n g d a t a from O LT P s o urc e s ys te ms. Th is process is E TL. C o mp l et in g pop ula t i n g t h e d a t a b a s e , c h e ck i n g for an o ma li es and tes t i n g for pe rf o rman c e are necessary before an ODS system can go online.

Source S ys te ms E TL O DS End Users

E x t ra c t io n T ra n s f o rma t i o n L oad i n g

M a n age m en t repo rt s

Ora c l e

O pera t i on a l Data Source (^) W efsba-fbda s ed A pp li ca t i o ns

IMS

S A P

In i t i a l l oad i n g + ref r es h i n g

Ot h e r A pp li ca t i o ns

CI CS E T L

Flat Fil es (^) Da t a W a r e ho u s e

Da t a M a rts

Fig : 1. 1 A po ss i b l e O pe ra t i o na l Data Store s t r u ct ur e

Zero Latency Enterprise (ZLE)

The Gantner Group has used a term Zero Latency Enterprise (ZLE) for near r ea l -t i m e i nteg ra t i o n of opera t i o na l data so that there is no significant delay in gett i n g in f o r ma t i o n from one part or one system of an e nter p r is e to another system that needs the i nf o r ma t io n. The heart of a ZLE system is an opera t i o na l data s to r e. A ZLE data store is s o m et h i n g like an ODS that is i nteg ra ted and u p-to-d a te. Th e aim of a ZLE data store is to allow management a single view of e n ter p ri s e i nf o r ma t i o n

by b r i n g i n g together r el e v ant data in r e a l -t i m e and p r o v i d i n g management a ― 360 -deg r ee‖ view of the c u s to m er. A Z LE usually has t h e following characteristics. It has a unified view of t h e e n ter p ri s e ope ra t io na l d a t a. It has a high level of a v a ila b ilit y and it involves online re f r es h i n g of i nf o r m a t i o n. The a c h i ev e t h es e , a Z LE r eq u ir e s i nf o r ma t i o n t ha t is as current as po ss i b l e. Since a ZLE needs to support a la r ge number of concurrent u s e rs , for ex a m p l e call centre us e rs , a fast turnaround t i m e for t rans a ct i o ns and 24 / 7 availability is required.

1.3 ETL

An ODS or a data warehouse is based on a single global s c h ema t ha t i nteg ra tes and c o ns o li d a te s e nter p r is e in f o r ma t i o n from many s o urc es. Building such a system requires data a c q u is it i o n from OLTP and legacy systems. The ETL process involves e x t ra ct i ng , t rans f o r m i n g and lo a d i n g data from source systems. The process may sound very simple since it only involves r e a d in g i n fo r ma t i o n from source d a t a b a s es , t ran s f o r m i n g it to f it t h e ODS database mode l and l o a d i n g it in the O DS. As d i ff e r e nt d a t a s o u rc es te n d to have d iff e r e nt c o nv ent i o ns for coding i nf o r ma t i o n and d iff er e n t s t an d a r d s for t h e quality of i nf o r ma t i o n , b u il d i n g an O DS req u i r es data filtering, data cleaning, and i nteg ra t i o n. The following ex a m p l es show the i m po rt an c e of data c l ean i ng : · If an e nte r p r is e wishes to contact its customers or its suppliers, it is e ss e nt ia l that a c o mp l ete , accurate and up-to-date list of contact a dd r ess es , email addresses and te l ep ho n e numbers be available. Correspondence sent to a wrong address that is then r ed ir e cted does not create a very good im p r ess i o n about the e nter p ri s e. · If a customer or s u pp li er calls, the staff r e s po n d i n g s h o ul d be quickly ale to find the person in the e n te r p r is e database but t h i s r eq u ir es that the ca ll er‘s name or h is / h er company name is accurately lis ted in the d a t a b a s e. · If a customer appears in the databases with two or more slightly d i ff er e nt names or d iff e r e nt a cc o unt nu m ber s , it be c o m e s d i ff i c u lt to u pd a te t h e c us to m er‘s i nf o r ma t i o n.

· Coding of some values in one database may not ma t c h with c od i n g in other databases (i.e. d iff er e nt codes with the same m eani n g or same code for d i ff e r e nt meanings) · M ean i n g s of some code values may not be known. · There may be d u p li ca te records. · There may be wrong a gg r eg a t i o ns. · There may be i n c o ns is te nt use of nulls, spaces and empty values. · Some a tt r i b ute values may be i n c o ns is te nt (i.e. o ut s i de t h ei r domain) · Some data may be wrong because of i n p ut errors. · There may be ina pp r op ria te use of address lines. · There may be n o n - un i q u e identifiers. The ETL process needs to ensure that all these types of errors and others are re s o lv ed us i n g a sound Technology.

3. Record linkage problem: Record linkage re la tes to t h e p r ob l e m of linking i nf o r ma t i o n from d iff e r e nt databases that r e la te to the same customer or cli e nt. The p r ob l e m can arise if a uni q u e identifier is not available in all databases t ha t are bei n g linked. Perhaps records from a database are bei n g linked to records from a legacy system or to i nf o r ma t i o n from a s p rea d s h eet. Record linkage can involve a la r ge number of record c o m p a r is o ns to ensure linkages that have a high level of accuracy. 4. Semantic integration problem: This deals with the i nteg ra t i o n of i nf o rma t i o n found in heterogeneous OLTP and legacy s o u rc e s. Some of the sources may be relational, some may not be. Some may be even in text do c u m e nt s. Some d a t a may be character s t ri n g s while others may be i nteger s. 5. Data integrity problem: Th is deals wit h issues like re f er e nt ia l i nteg r it y, null values, do ma i n of values, et c.

Overcoming all these p r ob l e ms is often a very ted i o us work. Many errors can be d iff i cu lt to identify. In some cases one may be forced to ask the q u es t i o n how accurate the d a t a ought to be since i m p r o v i n g t h e accuracy is always go i n g to req u i r e m o r e and more resources and c o m p l ete l y fixing all p rob l e ms may be un r ea li s t i c. Checking for d u p li ca tes is not always easy. The data can be sorted and d up li ca tes re m o v ed a lt h o u g h for large files t h is can be expensive. In s o m e cases t h e d u p li ca te records are not identical. In these cases checks for primary key may be r eq u i r ed. If more than one record has the same primary key then it is likely to be because of d u p li ca tes.

A sound t h eo r et i ca l background is be i n g dev e l oped for data cl ean i n g te c hn i q u es. I t has been suggested that data c l ean i n g s h o u l d be based on the following five s tep s :

1. Parsing: Parsing identifies various components of the source data files and t he n es t a b lis h es re la t i o ns h i p s bet w ee n t h o s e and t h e fields in t h e t a r get fil es. Th e classical ex a m p l e of p a rs i n g is identifying the various components of a pe rs o n‘s name and add ress. 2. Correcting: Correcting the i de nt i f i ed components is usually based on a v a ri et y of s op h is t i ca ted te c hn i q u es including ma t h e ma t i c a l a l go r it h ms. Correcting may involve use of other r e la ted i nf o r ma t i o n that may be available in the e nte r p ri s e. 3. Standardizing: Business rules of the e nte r p ri s e may now be used to t ran s f o r m the data to standard form. For example, in some c o m p an i es there mi g ht be rules on how name and address are to be r ep r es e nted. 4. Matching: Much of the data extracted from a number of source systems is lik el y to be r ela ted. Such data needs to be ma t c h ed. 5. Consolidating: A^ ll^ c^ o^ rr^ e^ cted^ ,^ s^ t^ an^ d^ a^ r^ d^ iz^ ed^ and^ ma^ t^ c^ h^ ed^ d^ a^ t^ a^ can^ now be c o ns o li d a ted to build a single version of the e nte r p ri s e d a t a.

Selecting an ETL Tool

Selection of an a pp rop r ia te ETL Tool is an i mpo rt ant de c is i o n that has to be made in c h oo s i n g c o mpo n e nt s of an ODS or d a t a wa r e h o us i n g a pp li ca t i o n. The E TL too l is req u i r ed to p r o v i de c oo r d i na ted access to multiple data sources so that r e l ev ant data may

i nf o r ma t i o n managers.

they require, it is likely to red u c e t h e b ur eau cra c y a ro un d t h e

· To p r o v i de a d a t a b a s e that s to r e s r e la t iv e l y clean d a t a. By using a good E TL p r o c e ss , the data warehouse s h o ul d have data of high q ua lit y. When errors are d is c o v er ed it may be des ira b l e to correct them directly in the data warehouse and then propagate the c o rr e ct i o ns to the OLTP s ys te ms. · To p r o v i de a database that stores historical data that may have been de l eted from t h e O LT P systems. To i mp r o v e re s po n s e t im e , h is to r i ca l d a t a is usually n ot ret a i n ed in O L TP s ys te ms ot h er t han t ha t which is req u i r ed to respond to customer q u er i es. The data warehouse can then store the data that is purged from the OLTP systems.

A useful way of s h o w i n g the r e la t io n s h i p between OLTP systems, a data warehouse and an ODS is given in Figure 7.2. The data warehouse is more like lo n g term memory of an en t erp r is e. The ob j e ct iv es in b u il d i n g the two systems, O D S and data w a r e ho u s e , are s o m ew ha t s ch em es.

c o nf li ct i n g and t h er e fo r e t h e t wo d a t a b a s es are likely to have d iff er e nt

ODS

Data w a r eh o u s e (^) OLTP s ys te m

Figure 7.2 Relationship between OLTP, ODS and DW systems.

In b u il d i n g and ODS, data w a r e h o us i n g is a process of integ ra t in g e nter p r is e-w i de d a t a , o r i g ina t i n g from a variety of s o u rc es , i nto a single repo s ito r y. As shown in Figure 7.3, t h e data warehouse may be a c e n t ra l e nte r p r is e-w i de data warehouse for use by all t h e de c is i o n makers in the e nter p r is e or it may c o ns is t of a number of smaller data w a r e h o u s e (often called data marts or local data warehouses) A d a t a ma rt s to r es i n f o r ma t i o n for a lim ited num be r of s u b j e ct areas. Fo r example, a company m i g ht have a data mart about ma rk et i n g that supports ma rk et i n g and s a les. The data ma rt approach is a tt ra ct iv e since beg i nn i n g with a single data mart is relatively inexpensive and ea s i e r to i m p l e m e nt. A data mart may be used as a proof of data warehouse c o n c ept. Data marts can a ls o cr ea te difficulties by s ett i n g up ―silos of i nf o r ma t i o n‖ a lt h o u g h one may build dependent data ma rt s , which are pop u la ted form the c ent ra l data warehouse. Data marts are often the common approach for b u il d i n g a data warehouse since t h e c o s t curve for d a t a ma r t s tends to be m o r e linear. A c e nt ra liz ed d a t a warehouse p r o j e ct can be very r e s o u rc e i nten s iv e and r eq u ir e s s i g n if i cant i nv es t m e nt at t h e beg i nn i n g a lt h o u g h overall costs over a number of years for a c ent ra li z ed data w a r e ho u s e and for de c e nt ra liz ed data marts are likely to be similar. A c e nt ra liz ed wa r eh o us e can p r o v i de bette r q ua lit y d a t a and minimize d a t a i n co n s i s ten c i es since the data quality is c o nt r o ll ed centrally. The too ls and procedures for p u tt i n g data in the warehouse can then be better c o nt r o ll ed. Controlling data quality w it h a de c e nt ra liz ed approach is obviously more difficult. As with any c e nt ra liz ed f un ct i o n , t h o ug h, the unit s or branches of an e nter p ris e may feel no o wn e r s h i p of the c e nt ra liz ed wa r eh o us e may in s o m e cases n ot fu ll y c oope ra te wit h t h e a d mi n i s t ra t i o n of t h e w a re ho us e. A ls o , ma i nt a in i n g a ce nt ra liz ed wa r e h o us e would r eq u i r e considerable c oo r d i na t i o n among the various un it s if the e nte r p ri s e is la r ge and t h is c oo r d i na t i o n may incur significant costs for the en t erp r is e. A s a n ex a m p l e of a d a t a w a reh o u s e a pp li ca t i o n we co n s ider th e te l e co mm uni ca t i o ns i n d us t r y which in m o s t c o unt r i es has be c o m e very c o m pet it iv e d u r i n g t h e la s t few years. If a company is able to identify a ma r k et t r e n d bef o r e it s c o mpet ito r s do , then that can l ea d to a c o m pet it iv e a d v ant a ge. What is therefore needed is to ana l ys e customer needs and be ha v i o u r in an attempt to bette r understand wha t t h e

B us i n e ss In teT lli oo ge l s n ce

E T L p r o c ess E xtr a c t T r ansf o r m and L o a d

E T L p r o c ess

E T L p r o c ess (^) p rE o^ T c^ L ess

DM a a t r at

DM a a t a rt

DM a a t r at

BI Tool

BI Tool

BI Tool

Da il y Change P r o c ess

(S tagi ng A r ea )

Da il y Change P r o c ess OData pe r a S t i ot o n r a el ( O D S)

W a^ D r e^ a h^ t o^ a use ( DW)

Figure 7.4 Another structure for ODS and DW

The a rc h ite ct ur e of a system that i n cl u de s an ODS and a data warehouse shown in Figure 7.4 is more complex. It involves e x t ra ct i n g i nf o r m a t io n from source systems by us i n g an ETL process and then s to r i n g the i nf o rm a t io n in a s t a g i n g d a t a b a s e. The daily changes also come to the s t a g i n g area. Another ETL process is used to transform i nf o r ma t i o n from t h e s t a g i n g area to pop ula te t h e O DS. The O DS is t he n used for supplying i nf o r ma t i o n via another ETL process to the area warehouse which in turn feeds a num be r of data marts that generate the reports r eq u ir ed by m ana ge m e nt. It s h o u l d be noted t ha t not all ETL processes in t h is a rc h ite ct u r e involve data cleaning, some may only involve data e x t ra ct io n and t rans fo r ma t i o n to s u it the target systems.

1.5 DATA WAREHOUSE DESIGN

There are a number of ways of c o n c ept ua liz i n g a data w a r e h o us e. One approach is to view it as a t h r ee- l ev e l s t r u ct ur e. The l o w es t level c o ns is t s of t h e O L T P and legacy systems, the m i dd l e level c o ns is t s of the g l ob a l or c e nt ra l data warehouse while the top level c o ns i s t s of local data warehouses or data ma rt s. Another approach is po ss i b l e if t h e e n ter p ri s e has an ODS. The three levels then m ig ht c o ns is t of OLTP and legacy s ys te ms at the botto m , the ODS in the mi dd l e and the data warehouse at the top. Whatever the a rc h itect u r e , a data warehouse needs to have a data m ode l that can form the b a s is for i m p l e m e nt in g it. To de v e lop a data m odel we view a data warehouse as a m u lt i d i m en s io na l structure c o ns is t i n g of d i m e n s i o ns , since that is an intuitive model t ha t ma t c h e s t h e t ypes of O LA P q u e r i es posed by mana ge m e nt. A d i m e n s i o n is an o r d i na te wit hi n a m ult i d i m en s io na l s t r uct u r e c o ns is t i n g of a lis t of o r de r ed values (s o m et i m es called members) j u s t like the x-axis and y-axis values on a t w o-d i m e ns i o na l g r ap h.

A characteristic of a star schema is that all the d i m e ns i o ns directly link to the fact t a b l e. The fact t a b l e may look like t a b l e 7.1 and the d i m e ns i o n t a b l es may look Tables 7.2 to 7.5.

Table 7.1 An^ e^ x^ a^ m^ p^ l^ e^ of^ the fact^ t^ a^ b^ l^ e

_ Year Degree name Country name Scholarship name Number 200301 BSc Australia Govt 35 199902 MBBS Canada None 50 200002 LLB USA ABC 22 199901 BCom UK Commonwealth 7 200102 LLB Australia Equity 2

The first d i m e ns i o n is the degree d im e n s i o n. An ex a m p l e of t hi s d im e n s i o n t a b l e is Table 7.2.

Table 7.2 An e x a m p l e of the degree d i m e ns i o n t a b l e

_ Name Faculty Scholarship eligibility Number of semesters B S c S c i en c e Y es 6 M BB S Medicine No 10 LLB La w Y es 8 B Co m LLB

Business No 6 Arts No 6

We now present the second d im e n s i o n , the country d i m e ns i o n. An e x a m p l e of t h is d i m en s i o n t a b l e is Table 7.3. Table 7.3 An e x a m p l e of the country d im e ns i o n t a b l e

_

N a m e religion N epa l I n do n es ia

Continent E du c at io n Level

Asia Lo w Asia Lo w

Major

Hinduism Islam N o rw a y S in g a po r e C o l o mb ia

Eu rope Asia South America

H ig h H ig h Lo w

Ch ris t i a n it y N U LL Ch ris t i a n it y

The t h ir d d i m e ns i o n is the s c h o la r s h i p d im e ns i o n. The d im e n s i o n t a b l e is given in Table 7.4.

Table 7.4 An e x a m p l e of the s c ho la rs h i p d i m e ns i o n t a b l e

_ Name Amount (%) Scholarship eligibility Number

Colombo 100 All 6 Equity 100 Low income 10 Asia 50 Top 5 % 8 Merit 75 Top 5 % 5 Bursary 25 Low income 12

The fourth d i m en s i o n is the year d i m e ns i o n. The d i m e ns i o n t a b l e is given in Table 7.5.

Table 7.5 An^ ex^ a^ m^ p^ l^ e^ of^ the year^ d^ im^ e^ ns^ i^ o^ n^ t^ a^ b^ l^ e

N a m e New programs 2001 2002

J o u rn a lis m M u l t im ed i a