Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data warehousing & data mining (sjbit) [10is74], Study notes of Data Structures and Algorithms

Visvesvaraya Technological University Data Structures and Algorithms

computer science 7th sem

Typology: Study notes

2014/2015

Uploaded on 07/12/2015

Mariyappan 🇮🇳

4

(1)

1 document

1 / 143

This page cannot be seen from the preview

Don't miss anything!

Data Warehousing & DataMinig

10IS74

Dept. of ISE, SJBIT

Page 1

DATA WAREHOUSING AND DATA MINING

PART – A

UNIT – 1

Data Warehousing:

6 Hours

I

n

t

r

od

u

c

t

i

o

n

,

O

pe

r

a

t

i

o

na

l

D

a

t

a

S

to

r

e

s

(

O

D

S

)

,

E

x

t

r

a

c

t

i

o

n

T

r

an

s

f

o

r

m

a

t

i

o

n

L

o

a

d

i

n

g

(

E

T

L

)

,

D

a

t

a

W

a

r

e

h

o

u

s

e

s

.

Design

I

ss

u

e

s

,

Guidelines

for

Data Warehouse

I

m

p

l

e

m

e

n

t

a

t

i

o

n

,

Data Warehouse

M

et

a

d

a

t

a

.

UNIT – 2 6 Hours

Online

Analytical

Processing

(OLAP):

I

n

t

r

od

u

c

t

i

o

n

,

C

ha

r

a

c

te

r

i

s

t

i

c

s

of

O

L

A

P

s

y

s

te

ms

,

M

u

l

t

i

d

i

m

e

n

s

i

o

na

l

view

and

D

a

t

a

cube,

D

a

t

a

Cube

I

m

p

l

e

m

e

n

t

a

t

i

o

n

s

,

D

a

t

a

Cube

ope

r

a

t

i

o

n

s

,

I

m

p

l

e

m

e

n

t

a

t

i

o

n

of

OLAP

and

overview on

OLAP S

o

f

t

w

a

r

e

s

.

UNIT – 3 6 Hours

Data

Mining:

I

n

t

r

od

u

c

t

i

o

n

,

C

ha

ll

e

n

ge

s

,

D

a

t

a

Mining

Tasks,

Types

of D

a

t

a

,

D

a

t

a

Preprocessing,

Measures

of

Similarity

and

Dissimilarity, Data Mining

A

pp

li

c

a

t

i

o

n

s

UNIT – 4 8 Hours

Association Analysis: Basic Concepts and Algorithms:

F

r

eq

u

e

n

t

I

te

ms

et

G

e

n

e

r

a

t

i

o

n

,

Rule

G

e

n

e

r

a

t

i

o

n

,

Compact

R

ep

r

e

s

e

n

t

a

t

i

o

n

of

F

r

eq

u

e

n

t

I

te

ms

et

s

,

Alternative

methods

for

ge

n

e

r

a

t

i

n

g

F

r

eq

u

e

n

t

I

te

ms

et

s

,

F

P

Growth

A

l

go

r

i

t

h

m

,

E

v

a

l

ua

t

i

o

n

of

Association

P

a

tte

r

n

s

UNIT – 5

PART - B

6 Hours

Classification -1 : Basics,

General

approach to

solve classification

p

r

ob

l

e

m

,

Decision

Trees, Rule Based

Classifiers,

Nearest

N

e

i

g

h

bo

r

Classifiers.

UNIT – 6

Classification - 2

: Bayesian

Classifiers,

E

s

t

i

m

a

t

i

n

g

P

r

ed

i

c

t

i

v

e

accuracy of

c

l

a

ss

i

f

i

c

a

t

i

o

n

6 Hours

m

et

h

od

s

,

I

m

p

r

o

v

i

n

g

accuracy

of

clarification

m

et

h

od

s

,

E

v

a

l

ua

t

i

o

n

criteria

for

classification

m

et

h

od

s

,

M

u

l

t

i

c

l

a

ss

Problem.

Discover Study notes of Data Structures and Algorithms Visvesvaraya Technological University

Partial preview of the text

Download Data warehousing & data mining (sjbit) [10is74] and more Study notes Data Structures and Algorithms in PDF only on Docsity!

DATA WAREHOUSING AND DATA MINING

PART – A

UNIT – 1

Data Warehousing:

6 Hours

Int r od u ct i o n, O pe ra t i o na l Da t a S to r es ( O DS ), E x t ra ct i o n Tran s f o rm a t io n Lo a d i n g (E T L) , Da t a W a r e ho u s es. Design Iss u es , Guidelines for Data Warehouse Im p l e m ent a t i o n , Data Warehouse M et a d a t a.

UNIT – 2 6 Hours Online Analytical Processing (OLAP): Int r od u ct i o n, C ha ra cte r i s t i cs of O L A P s ys tems , M u lt i d im e ns i o na l view and Da t a cube, Da t a Cube Im p l e m e nt a t i o ns , Da t a Cube ope ra t i o ns , Im p l e m e nt a t i o n of OLAP and overview on OLAP S o ft wa r e s.

UNIT – 3 6 Hours Data Mining: Int^ rod^ u^ c^ t^ i^ o^ n^ ,^ C^ ha^ ll^ e^ nges^ ,^ Da^ t^ a^ Mining^ Tasks,^ Types^ of Da^ t^ a^ ,^ Da^ t^ a^ Preprocessing, Measures of Similarity and Dissimilarity, Data Mining A pp li ca t i o ns

UNIT – 4 8 Hours Association Analysis: Basic Concepts and Algorithms: Fr eq u e nt Items et G e n e ra t i o n , Rule G e n e ra t i o n, Compact R ep r es e nt a t i o n of F r eq u e nt Ite ms et s , Alternative methods for ge n e ra t i n g Fr eq u e nt Ite ms et s , FP Growth A l go r it hm ,E v a l ua t i o n of Association P a tte rn s

UNIT – 5

PART - B

6 Hours Classification -1 : Basics, General approach to solve classification p r ob l e m, Decision Trees, Rule Based Classifiers, Nearest N ei g h bo r Classifiers.

UNIT – 6 Classification - 2 : Bayesian Classifiers, E s t ima t i n g P r ed i ct iv e accuracy of c la ss if i ca t i o n

6 Hours m et h od s , Im p r o v i n g accuracy of clarification m et h od s , E v a l ua t i o n criteria for classification m et hod s , M u lt i c l a ss Problem.

UNIT – 7 8 Hours Clustering Techniques: Overview, Fea t u r es of c l us te r analysis, Types of Data and C o m p ut i n g D is t an c e, Types of Cluster Analysis M et h od s , Partitional M et h od s , Hierarchical M et h od s , Density Based M et h od s , Quality and V a li d it y of Cluster Analysis.

UNIT – 8 6 Hours Web Mining: Int rod u ct i o n, Web c o ntent mining, T ex t Mining, U ns t r u ct ur ed T ex t , Te x t c l us ter in g , Mining Spatial and Te m po ra l Da t a b a s es.

Text Books:

P an g- N i n g Tan, Michael Steinbach, Vipin K u ma r: Int r od u ct i o n to Data Mining, Pearson E d u ca t i o n ,
G. K. G upt a : Int r od uct i o n to Data Mining with Case Studies, 3 r d E d it i o n , PHI, New Delhi, 2009. Reference Books:
Arun K P uja r i: Data Mining Techniques, 2nd Edition, UniversitiesPress, 2009.
Jiawei Han and Micheline K a m be r: Data Mining - Concepts and Techniques, 2nd Edition, Morgan K auf mann Publisher, 2006.
Alex Berson and Stephen J. S mit h: Data Warehousing,

UNIT – 7 Clustering Techniques:

4.3Rule Generation,
4.4 Compact Representation of Frequent Itemsets,
4.5 Alternative methods for generating Frequent Itemsets,
4.6 FP Growth Algorithm,Evaluation of Association Patterns
UNIT – 5 & UNIT –
5.1Classification - 1: Basics,
5.2 General approach to solve classification problem,
5.3 Decision Trees,
5.4 Rule Based Classifiers,
5.5 Nearest Neighbor Classifiers.
5 .6 Classification - 2 : Bayesian Classifiers,
7.1Overview,
7.2 Features of cluster analysis,
7.3 Types of Data and Computing Distance,
- Based Methods,
7.5 Quality and Validity of Cluster Analysis.
8 1 Introduction, UNIT – 8 Web Mining:
8.2 Web content mining,
8.3 Text Mining,
8 4 Unstructured Text,
8.5 Text clustering,
8.6 Mining Spatial and Temporal Databases.

UNIT 1

DATA WAREHOUSING

11 INTRODUCTION

Major e nter p r is e s have many computers that run a variety of e nter p r is e a pp li ca t i o ns. For an e nter p ris e wit h branches in many l o ca t io n s , t h e branches may have t h e ir own systems. For example, in a university with only one ca m p u s , the library may run its own ca t a l og and bo rr o w i n g database system while the student a d m in i s t ra t io n may have own s ys te ms running on an ot h e r ma ch i n e. There m ig ht be a separate finance system, a p r opert y and facilities mana ge m e nt s ys te m a n d a n o th er fo r hu man res o u rces mana gem e nt. A l a rge company mi g ht have the following s ys te m. · Human Resources · Financials · Billi ng · Sales leads · Web s a l es · Customer s u ppo rt Such s ys te ms are called online t rans a ct i o n p r o c e ss i n g ( O LT P ) s ys te ms. The O LT P systems are mostly r e la t i o na l database systems de s i g n ed for t rans a ct i o n p ro c ess i n g. Th e performance of OLTP systems is usually very i m po rt ant since such systems are used to support the users ( i. e. staff) that p ro v i de service to the c u s to m er s. The s ys te ms

P op ula t i n g an ODS involves an a c q u is it i o n process of ex t r a ct i n g , t rans f o r m i n g an d l o a d i n g d a t a from O LT P s o urc e s ys te ms. Th is process is E TL. C o mp l et in g pop ula t i n g t h e d a t a b a s e , c h e ck i n g for an o ma li es and tes t i n g for pe rf o rman c e are necessary before an ODS system can go online.

Source S ys te ms E TL O DS End Users

E x t ra c t io n T ra n s f o rma t i o n L oad i n g

M a n age m en t repo rt s

Ora c l e

O pera t i on a l Data Source (^) W efsba-fbda s ed A pp li ca t i o ns

IMS

S A P

In i t i a l l oad i n g + ref r es h i n g

Ot h e r A pp li ca t i o ns

CI CS E T L

Flat Fil es (^) Da t a W a r e ho u s e

Da t a M a rts

Fig : 1. 1 A po ss i b l e O pe ra t i o na l Data Store s t r u ct ur e

Zero Latency Enterprise (ZLE)

The Gantner Group has used a term Zero Latency Enterprise (ZLE) for near r ea l -t i m e i nteg ra t i o n of opera t i o na l data so that there is no significant delay in gett i n g in f o r ma t i o n from one part or one system of an e nter p r is e to another system that needs the i nf o r ma t io n. The heart of a ZLE system is an opera t i o na l data s to r e. A ZLE data store is s o m et h i n g like an ODS that is i nteg ra ted and u p-to-d a te. Th e aim of a ZLE data store is to allow management a single view of e n ter p ri s e i nf o r ma t i o n

by b r i n g i n g together r el e v ant data in r e a l -t i m e and p r o v i d i n g management a ― 360 -deg r ee‖ view of the c u s to m er. A Z LE usually has t h e following characteristics. It has a unified view of t h e e n ter p ri s e ope ra t io na l d a t a. It has a high level of a v a ila b ilit y and it involves online re f r es h i n g of i nf o r m a t i o n. The a c h i ev e t h es e , a Z LE r eq u ir e s i nf o r ma t i o n t ha t is as current as po ss i b l e. Since a ZLE needs to support a la r ge number of concurrent u s e rs , for ex a m p l e call centre us e rs , a fast turnaround t i m e for t rans a ct i o ns and 24 / 7 availability is required.

1.3 ETL

An ODS or a data warehouse is based on a single global s c h ema t ha t i nteg ra tes and c o ns o li d a te s e nter p r is e in f o r ma t i o n from many s o urc es. Building such a system requires data a c q u is it i o n from OLTP and legacy systems. The ETL process involves e x t ra ct i ng , t rans f o r m i n g and lo a d i n g data from source systems. The process may sound very simple since it only involves r e a d in g i n fo r ma t i o n from source d a t a b a s es , t ran s f o r m i n g it to f it t h e ODS database mode l and l o a d i n g it in the O DS. As d i ff e r e nt d a t a s o u rc es te n d to have d iff e r e nt c o nv ent i o ns for coding i nf o r ma t i o n and d iff er e n t s t an d a r d s for t h e quality of i nf o r ma t i o n , b u il d i n g an O DS req u i r es data filtering, data cleaning, and i nteg ra t i o n. The following ex a m p l es show the i m po rt an c e of data c l ean i ng : · If an e nte r p r is e wishes to contact its customers or its suppliers, it is e ss e nt ia l that a c o mp l ete , accurate and up-to-date list of contact a dd r ess es , email addresses and te l ep ho n e numbers be available. Correspondence sent to a wrong address that is then r ed ir e cted does not create a very good im p r ess i o n about the e nter p ri s e. · If a customer or s u pp li er calls, the staff r e s po n d i n g s h o ul d be quickly ale to find the person in the e n te r p r is e database but t h i s r eq u ir es that the ca ll er‘s name or h is / h er company name is accurately lis ted in the d a t a b a s e. · If a customer appears in the databases with two or more slightly d i ff er e nt names or d iff e r e nt a cc o unt nu m ber s , it be c o m e s d i ff i c u lt to u pd a te t h e c us to m er‘s i nf o r ma t i o n.

· Coding of some values in one database may not ma t c h with c od i n g in other databases (i.e. d iff er e nt codes with the same m eani n g or same code for d i ff e r e nt meanings) · M ean i n g s of some code values may not be known. · There may be d u p li ca te records. · There may be wrong a gg r eg a t i o ns. · There may be i n c o ns is te nt use of nulls, spaces and empty values. · Some a tt r i b ute values may be i n c o ns is te nt (i.e. o ut s i de t h ei r domain) · Some data may be wrong because of i n p ut errors. · There may be ina pp r op ria te use of address lines. · There may be n o n - un i q u e identifiers. The ETL process needs to ensure that all these types of errors and others are re s o lv ed us i n g a sound Technology.

3. Record linkage problem: Record linkage re la tes to t h e p r ob l e m of linking i nf o r ma t i o n from d iff e r e nt databases that r e la te to the same customer or cli e nt. The p r ob l e m can arise if a uni q u e identifier is not available in all databases t ha t are bei n g linked. Perhaps records from a database are bei n g linked to records from a legacy system or to i nf o r ma t i o n from a s p rea d s h eet. Record linkage can involve a la r ge number of record c o m p a r is o ns to ensure linkages that have a high level of accuracy. 4. Semantic integration problem: This deals with the i nteg ra t i o n of i nf o rma t i o n found in heterogeneous OLTP and legacy s o u rc e s. Some of the sources may be relational, some may not be. Some may be even in text do c u m e nt s. Some d a t a may be character s t ri n g s while others may be i nteger s. 5. Data integrity problem: Th is deals wit h issues like re f er e nt ia l i nteg r it y, null values, do ma i n of values, et c.

Overcoming all these p r ob l e ms is often a very ted i o us work. Many errors can be d iff i cu lt to identify. In some cases one may be forced to ask the q u es t i o n how accurate the d a t a ought to be since i m p r o v i n g t h e accuracy is always go i n g to req u i r e m o r e and more resources and c o m p l ete l y fixing all p rob l e ms may be un r ea li s t i c. Checking for d u p li ca tes is not always easy. The data can be sorted and d up li ca tes re m o v ed a lt h o u g h for large files t h is can be expensive. In s o m e cases t h e d u p li ca te records are not identical. In these cases checks for primary key may be r eq u i r ed. If more than one record has the same primary key then it is likely to be because of d u p li ca tes.

A sound t h eo r et i ca l background is be i n g dev e l oped for data cl ean i n g te c hn i q u es. I t has been suggested that data c l ean i n g s h o u l d be based on the following five s tep s :

1. Parsing: Parsing identifies various components of the source data files and t he n es t a b lis h es re la t i o ns h i p s bet w ee n t h o s e and t h e fields in t h e t a r get fil es. Th e classical ex a m p l e of p a rs i n g is identifying the various components of a pe rs o n‘s name and add ress. 2. Correcting: Correcting the i de nt i f i ed components is usually based on a v a ri et y of s op h is t i ca ted te c hn i q u es including ma t h e ma t i c a l a l go r it h ms. Correcting may involve use of other r e la ted i nf o r ma t i o n that may be available in the e nte r p ri s e. 3. Standardizing: Business rules of the e nte r p ri s e may now be used to t ran s f o r m the data to standard form. For example, in some c o m p an i es there mi g ht be rules on how name and address are to be r ep r es e nted. 4. Matching: Much of the data extracted from a number of source systems is lik el y to be r ela ted. Such data needs to be ma t c h ed. 5. Consolidating: A^ ll^ c^ o^ rr^ e^ cted^ ,^ s^ t^ an^ d^ a^ r^ d^ iz^ ed^ and^ ma^ t^ c^ h^ ed^ d^ a^ t^ a^ can^ now be c o ns o li d a ted to build a single version of the e nte r p ri s e d a t a.

Selecting an ETL Tool

Selection of an a pp rop r ia te ETL Tool is an i mpo rt ant de c is i o n that has to be made in c h oo s i n g c o mpo n e nt s of an ODS or d a t a wa r e h o us i n g a pp li ca t i o n. The E TL too l is req u i r ed to p r o v i de c oo r d i na ted access to multiple data sources so that r e l ev ant data may

i nf o r ma t i o n managers.

they require, it is likely to red u c e t h e b ur eau cra c y a ro un d t h e

· To p r o v i de a d a t a b a s e that s to r e s r e la t iv e l y clean d a t a. By using a good E TL p r o c e ss , the data warehouse s h o ul d have data of high q ua lit y. When errors are d is c o v er ed it may be des ira b l e to correct them directly in the data warehouse and then propagate the c o rr e ct i o ns to the OLTP s ys te ms. · To p r o v i de a database that stores historical data that may have been de l eted from t h e O LT P systems. To i mp r o v e re s po n s e t im e , h is to r i ca l d a t a is usually n ot ret a i n ed in O L TP s ys te ms ot h er t han t ha t which is req u i r ed to respond to customer q u er i es. The data warehouse can then store the data that is purged from the OLTP systems.

A useful way of s h o w i n g the r e la t io n s h i p between OLTP systems, a data warehouse and an ODS is given in Figure 7.2. The data warehouse is more like lo n g term memory of an en t erp r is e. The ob j e ct iv es in b u il d i n g the two systems, O D S and data w a r e ho u s e , are s o m ew ha t s ch em es.

c o nf li ct i n g and t h er e fo r e t h e t wo d a t a b a s es are likely to have d iff er e nt

ODS

Data w a r eh o u s e (^) OLTP s ys te m

Figure 7.2 Relationship between OLTP, ODS and DW systems.

In b u il d i n g and ODS, data w a r e h o us i n g is a process of integ ra t in g e nter p r is e-w i de d a t a , o r i g ina t i n g from a variety of s o u rc es , i nto a single repo s ito r y. As shown in Figure 7.3, t h e data warehouse may be a c e n t ra l e nte r p r is e-w i de data warehouse for use by all t h e de c is i o n makers in the e nter p r is e or it may c o ns is t of a number of smaller data w a r e h o u s e (often called data marts or local data warehouses) A d a t a ma rt s to r es i n f o r ma t i o n for a lim ited num be r of s u b j e ct areas. Fo r example, a company m i g ht have a data mart about ma rk et i n g that supports ma rk et i n g and s a les. The data ma rt approach is a tt ra ct iv e since beg i nn i n g with a single data mart is relatively inexpensive and ea s i e r to i m p l e m e nt. A data mart may be used as a proof of data warehouse c o n c ept. Data marts can a ls o cr ea te difficulties by s ett i n g up ―silos of i nf o r ma t i o n‖ a lt h o u g h one may build dependent data ma rt s , which are pop u la ted form the c ent ra l data warehouse. Data marts are often the common approach for b u il d i n g a data warehouse since t h e c o s t curve for d a t a ma r t s tends to be m o r e linear. A c e nt ra liz ed d a t a warehouse p r o j e ct can be very r e s o u rc e i nten s iv e and r eq u ir e s s i g n if i cant i nv es t m e nt at t h e beg i nn i n g a lt h o u g h overall costs over a number of years for a c ent ra li z ed data w a r e ho u s e and for de c e nt ra liz ed data marts are likely to be similar. A c e nt ra liz ed wa r eh o us e can p r o v i de bette r q ua lit y d a t a and minimize d a t a i n co n s i s ten c i es since the data quality is c o nt r o ll ed centrally. The too ls and procedures for p u tt i n g data in the warehouse can then be better c o nt r o ll ed. Controlling data quality w it h a de c e nt ra liz ed approach is obviously more difficult. As with any c e nt ra liz ed f un ct i o n , t h o ug h, the unit s or branches of an e nter p ris e may feel no o wn e r s h i p of the c e nt ra liz ed wa r eh o us e may in s o m e cases n ot fu ll y c oope ra te wit h t h e a d mi n i s t ra t i o n of t h e w a re ho us e. A ls o , ma i nt a in i n g a ce nt ra liz ed wa r e h o us e would r eq u i r e considerable c oo r d i na t i o n among the various un it s if the e nte r p ri s e is la r ge and t h is c oo r d i na t i o n may incur significant costs for the en t erp r is e. A s a n ex a m p l e of a d a t a w a reh o u s e a pp li ca t i o n we co n s ider th e te l e co mm uni ca t i o ns i n d us t r y which in m o s t c o unt r i es has be c o m e very c o m pet it iv e d u r i n g t h e la s t few years. If a company is able to identify a ma r k et t r e n d bef o r e it s c o mpet ito r s do , then that can l ea d to a c o m pet it iv e a d v ant a ge. What is therefore needed is to ana l ys e customer needs and be ha v i o u r in an attempt to bette r understand wha t t h e

B us i n e ss In teT lli oo ge l s n ce

E T L p r o c ess E xtr a c t T r ansf o r m and L o a d

E T L p r o c ess

E T L p r o c ess (^) p rE o^ T c^ L ess

DM a a t r at

DM a a t a rt

DM a a t r at

BI Tool

Da il y Change P r o c ess

(S tagi ng A r ea )

Da il y Change P r o c ess OData pe r a S t i ot o n r a el ( O D S)

W a^ D r e^ a h^ t o^ a use ( DW)

Figure 7.4 Another structure for ODS and DW

The a rc h ite ct ur e of a system that i n cl u de s an ODS and a data warehouse shown in Figure 7.4 is more complex. It involves e x t ra ct i n g i nf o r m a t io n from source systems by us i n g an ETL process and then s to r i n g the i nf o rm a t io n in a s t a g i n g d a t a b a s e. The daily changes also come to the s t a g i n g area. Another ETL process is used to transform i nf o r ma t i o n from t h e s t a g i n g area to pop ula te t h e O DS. The O DS is t he n used for supplying i nf o r ma t i o n via another ETL process to the area warehouse which in turn feeds a num be r of data marts that generate the reports r eq u ir ed by m ana ge m e nt. It s h o u l d be noted t ha t not all ETL processes in t h is a rc h ite ct u r e involve data cleaning, some may only involve data e x t ra ct io n and t rans fo r ma t i o n to s u it the target systems.

1.5 DATA WAREHOUSE DESIGN

There are a number of ways of c o n c ept ua liz i n g a data w a r e h o us e. One approach is to view it as a t h r ee- l ev e l s t r u ct ur e. The l o w es t level c o ns is t s of t h e O L T P and legacy systems, the m i dd l e level c o ns is t s of the g l ob a l or c e nt ra l data warehouse while the top level c o ns i s t s of local data warehouses or data ma rt s. Another approach is po ss i b l e if t h e e n ter p ri s e has an ODS. The three levels then m ig ht c o ns is t of OLTP and legacy s ys te ms at the botto m , the ODS in the mi dd l e and the data warehouse at the top. Whatever the a rc h itect u r e , a data warehouse needs to have a data m ode l that can form the b a s is for i m p l e m e nt in g it. To de v e lop a data m odel we view a data warehouse as a m u lt i d i m en s io na l structure c o ns is t i n g of d i m e n s i o ns , since that is an intuitive model t ha t ma t c h e s t h e t ypes of O LA P q u e r i es posed by mana ge m e nt. A d i m e n s i o n is an o r d i na te wit hi n a m ult i d i m en s io na l s t r uct u r e c o ns is t i n g of a lis t of o r de r ed values (s o m et i m es called members) j u s t like the x-axis and y-axis values on a t w o-d i m e ns i o na l g r ap h.

A characteristic of a star schema is that all the d i m e ns i o ns directly link to the fact t a b l e. The fact t a b l e may look like t a b l e 7.1 and the d i m e ns i o n t a b l es may look Tables 7.2 to 7.5.

Table 7.1 An^ e^ x^ a^ m^ p^ l^ e^ of^ the fact^ t^ a^ b^ l^ e

_ Year Degree name Country name Scholarship name Number 200301 BSc Australia Govt 35 199902 MBBS Canada None 50 200002 LLB USA ABC 22 199901 BCom UK Commonwealth 7 200102 LLB Australia Equity 2

The first d i m e ns i o n is the degree d im e n s i o n. An ex a m p l e of t hi s d im e n s i o n t a b l e is Table 7.2.

Table 7.2 An e x a m p l e of the degree d i m e ns i o n t a b l e

_ Name Faculty Scholarship eligibility Number of semesters B S c S c i en c e Y es 6 M BB S Medicine No 10 LLB La w Y es 8 B Co m LLB

Business No 6 Arts No 6

We now present the second d im e n s i o n , the country d i m e ns i o n. An e x a m p l e of t h is d i m en s i o n t a b l e is Table 7.3. Table 7.3 An e x a m p l e of the country d im e ns i o n t a b l e

_

N a m e religion N epa l I n do n es ia

Continent E du c at io n Level

Asia Lo w Asia Lo w

Major

Hinduism Islam N o rw a y S in g a po r e C o l o mb ia

Eu rope Asia South America

H ig h H ig h Lo w

Ch ris t i a n it y N U LL Ch ris t i a n it y

The t h ir d d i m e ns i o n is the s c h o la r s h i p d im e ns i o n. The d im e n s i o n t a b l e is given in Table 7.4.

Table 7.4 An e x a m p l e of the s c ho la rs h i p d i m e ns i o n t a b l e

_ Name Amount (%) Scholarship eligibility Number

Colombo 100 All 6 Equity 100 Low income 10 Asia 50 Top 5 % 8 Merit 75 Top 5 % 5 Bursary 25 Low income 12

The fourth d i m en s i o n is the year d i m e ns i o n. The d i m e ns i o n t a b l e is given in Table 7.5.

Table 7.5 An^ ex^ a^ m^ p^ l^ e^ of^ the year^ d^ im^ e^ ns^ i^ o^ n^ t^ a^ b^ l^ e

N a m e New programs 2001 2002

J o u rn a lis m M u l t im ed i a

Data warehousing & data mining (sjbit) [10is74], Study notes of Data Structures and Algorithms

Related documents

Partial preview of the text

Download Data warehousing & data mining (sjbit) [10is74] and more Study notes Data Structures and Algorithms in PDF only on Docsity!

DATA WAREHOUSING AND DATA MINING

PART – A

UNIT – 1

UNIT – 5

PART - B

UNIT – 7 Clustering Techniques:

UNIT 1

DATA WAREHOUSING

11 INTRODUCTION

CI CS E T L

ODS

_