




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
computer science 7th sem
Typology: Study notes
1 / 143
This page cannot be seen from the preview
Don't miss anything!





























































































Data Warehousing:
6 Hours
Int r od u ct i o n, O pe ra t i o na l Da t a S to r es ( O DS ), E x t ra ct i o n Tran s f o rm a t io n Lo a d i n g (E T L) , Da t a W a r e ho u s es. Design Iss u es , Guidelines for Data Warehouse Im p l e m ent a t i o n , Data Warehouse M et a d a t a.
UNIT – 2 6 Hours Online Analytical Processing (OLAP): Int r od u ct i o n, C ha ra cte r i s t i cs of O L A P s ys tems , M u lt i d im e ns i o na l view and Da t a cube, Da t a Cube Im p l e m e nt a t i o ns , Da t a Cube ope ra t i o ns , Im p l e m e nt a t i o n of OLAP and overview on OLAP S o ft wa r e s.
UNIT – 3 6 Hours Data Mining: Int^ rod^ u^ c^ t^ i^ o^ n^ ,^ C^ ha^ ll^ e^ nges^ ,^ Da^ t^ a^ Mining^ Tasks,^ Types^ of Da^ t^ a^ ,^ Da^ t^ a^ Preprocessing, Measures of Similarity and Dissimilarity, Data Mining A pp li ca t i o ns
UNIT – 4 8 Hours Association Analysis: Basic Concepts and Algorithms: Fr eq u e nt Items et G e n e ra t i o n , Rule G e n e ra t i o n, Compact R ep r es e nt a t i o n of F r eq u e nt Ite ms et s , Alternative methods for ge n e ra t i n g Fr eq u e nt Ite ms et s , FP Growth A l go r it hm ,E v a l ua t i o n of Association P a tte rn s
6 Hours Classification -1 : Basics, General approach to solve classification p r ob l e m, Decision Trees, Rule Based Classifiers, Nearest N ei g h bo r Classifiers.
UNIT – 6 Classification - 2 : Bayesian Classifiers, E s t ima t i n g P r ed i ct iv e accuracy of c la ss if i ca t i o n
6 Hours m et h od s , Im p r o v i n g accuracy of clarification m et h od s , E v a l ua t i o n criteria for classification m et hod s , M u lt i c l a ss Problem.
UNIT – 7 8 Hours Clustering Techniques: Overview, Fea t u r es of c l us te r analysis, Types of Data and C o m p ut i n g D is t an c e, Types of Cluster Analysis M et h od s , Partitional M et h od s , Hierarchical M et h od s , Density Based M et h od s , Quality and V a li d it y of Cluster Analysis.
UNIT – 8 6 Hours Web Mining: Int rod u ct i o n, Web c o ntent mining, T ex t Mining, U ns t r u ct ur ed T ex t , Te x t c l us ter in g , Mining Spatial and Te m po ra l Da t a b a s es.
Text Books:
Major e nter p r is e s have many computers that run a variety of e nter p r is e a pp li ca t i o ns. For an e nter p ris e wit h branches in many l o ca t io n s , t h e branches may have t h e ir own systems. For example, in a university with only one ca m p u s , the library may run its own ca t a l og and bo rr o w i n g database system while the student a d m in i s t ra t io n may have own s ys te ms running on an ot h e r ma ch i n e. There m ig ht be a separate finance system, a p r opert y and facilities mana ge m e nt s ys te m a n d a n o th er fo r hu man res o u rces mana gem e nt. A l a rge company mi g ht have the following s ys te m. · Human Resources · Financials · Billi ng · Sales leads · Web s a l es · Customer s u ppo rt Such s ys te ms are called online t rans a ct i o n p r o c e ss i n g ( O LT P ) s ys te ms. The O LT P systems are mostly r e la t i o na l database systems de s i g n ed for t rans a ct i o n p ro c ess i n g. Th e performance of OLTP systems is usually very i m po rt ant since such systems are used to support the users ( i. e. staff) that p ro v i de service to the c u s to m er s. The s ys te ms
P op ula t i n g an ODS involves an a c q u is it i o n process of ex t r a ct i n g , t rans f o r m i n g an d l o a d i n g d a t a from O LT P s o urc e s ys te ms. Th is process is E TL. C o mp l et in g pop ula t i n g t h e d a t a b a s e , c h e ck i n g for an o ma li es and tes t i n g for pe rf o rman c e are necessary before an ODS system can go online.
Source S ys te ms E TL O DS End Users
E x t ra c t io n T ra n s f o rma t i o n L oad i n g
M a n age m en t repo rt s
Ora c l e
O pera t i on a l Data Source (^) W efsba-fbda s ed A pp li ca t i o ns
IMS
S A P
In i t i a l l oad i n g + ref r es h i n g
Ot h e r A pp li ca t i o ns
Flat Fil es (^) Da t a W a r e ho u s e
Da t a M a rts
Fig : 1. 1 A po ss i b l e O pe ra t i o na l Data Store s t r u ct ur e
Zero Latency Enterprise (ZLE)
The Gantner Group has used a term Zero Latency Enterprise (ZLE) for near r ea l -t i m e i nteg ra t i o n of opera t i o na l data so that there is no significant delay in gett i n g in f o r ma t i o n from one part or one system of an e nter p r is e to another system that needs the i nf o r ma t io n. The heart of a ZLE system is an opera t i o na l data s to r e. A ZLE data store is s o m et h i n g like an ODS that is i nteg ra ted and u p-to-d a te. Th e aim of a ZLE data store is to allow management a single view of e n ter p ri s e i nf o r ma t i o n
by b r i n g i n g together r el e v ant data in r e a l -t i m e and p r o v i d i n g management a ― 360 -deg r ee‖ view of the c u s to m er. A Z LE usually has t h e following characteristics. It has a unified view of t h e e n ter p ri s e ope ra t io na l d a t a. It has a high level of a v a ila b ilit y and it involves online re f r es h i n g of i nf o r m a t i o n. The a c h i ev e t h es e , a Z LE r eq u ir e s i nf o r ma t i o n t ha t is as current as po ss i b l e. Since a ZLE needs to support a la r ge number of concurrent u s e rs , for ex a m p l e call centre us e rs , a fast turnaround t i m e for t rans a ct i o ns and 24 / 7 availability is required.
1.3 ETL
An ODS or a data warehouse is based on a single global s c h ema t ha t i nteg ra tes and c o ns o li d a te s e nter p r is e in f o r ma t i o n from many s o urc es. Building such a system requires data a c q u is it i o n from OLTP and legacy systems. The ETL process involves e x t ra ct i ng , t rans f o r m i n g and lo a d i n g data from source systems. The process may sound very simple since it only involves r e a d in g i n fo r ma t i o n from source d a t a b a s es , t ran s f o r m i n g it to f it t h e ODS database mode l and l o a d i n g it in the O DS. As d i ff e r e nt d a t a s o u rc es te n d to have d iff e r e nt c o nv ent i o ns for coding i nf o r ma t i o n and d iff er e n t s t an d a r d s for t h e quality of i nf o r ma t i o n , b u il d i n g an O DS req u i r es data filtering, data cleaning, and i nteg ra t i o n. The following ex a m p l es show the i m po rt an c e of data c l ean i ng : · If an e nte r p r is e wishes to contact its customers or its suppliers, it is e ss e nt ia l that a c o mp l ete , accurate and up-to-date list of contact a dd r ess es , email addresses and te l ep ho n e numbers be available. Correspondence sent to a wrong address that is then r ed ir e cted does not create a very good im p r ess i o n about the e nter p ri s e. · If a customer or s u pp li er calls, the staff r e s po n d i n g s h o ul d be quickly ale to find the person in the e n te r p r is e database but t h i s r eq u ir es that the ca ll er‘s name or h is / h er company name is accurately lis ted in the d a t a b a s e. · If a customer appears in the databases with two or more slightly d i ff er e nt names or d iff e r e nt a cc o unt nu m ber s , it be c o m e s d i ff i c u lt to u pd a te t h e c us to m er‘s i nf o r ma t i o n.
· Coding of some values in one database may not ma t c h with c od i n g in other databases (i.e. d iff er e nt codes with the same m eani n g or same code for d i ff e r e nt meanings) · M ean i n g s of some code values may not be known. · There may be d u p li ca te records. · There may be wrong a gg r eg a t i o ns. · There may be i n c o ns is te nt use of nulls, spaces and empty values. · Some a tt r i b ute values may be i n c o ns is te nt (i.e. o ut s i de t h ei r domain) · Some data may be wrong because of i n p ut errors. · There may be ina pp r op ria te use of address lines. · There may be n o n - un i q u e identifiers. The ETL process needs to ensure that all these types of errors and others are re s o lv ed us i n g a sound Technology.
3. Record linkage problem: Record linkage re la tes to t h e p r ob l e m of linking i nf o r ma t i o n from d iff e r e nt databases that r e la te to the same customer or cli e nt. The p r ob l e m can arise if a uni q u e identifier is not available in all databases t ha t are bei n g linked. Perhaps records from a database are bei n g linked to records from a legacy system or to i nf o r ma t i o n from a s p rea d s h eet. Record linkage can involve a la r ge number of record c o m p a r is o ns to ensure linkages that have a high level of accuracy. 4. Semantic integration problem: This deals with the i nteg ra t i o n of i nf o rma t i o n found in heterogeneous OLTP and legacy s o u rc e s. Some of the sources may be relational, some may not be. Some may be even in text do c u m e nt s. Some d a t a may be character s t ri n g s while others may be i nteger s. 5. Data integrity problem: Th is deals wit h issues like re f er e nt ia l i nteg r it y, null values, do ma i n of values, et c.
Overcoming all these p r ob l e ms is often a very ted i o us work. Many errors can be d iff i cu lt to identify. In some cases one may be forced to ask the q u es t i o n how accurate the d a t a ought to be since i m p r o v i n g t h e accuracy is always go i n g to req u i r e m o r e and more resources and c o m p l ete l y fixing all p rob l e ms may be un r ea li s t i c. Checking for d u p li ca tes is not always easy. The data can be sorted and d up li ca tes re m o v ed a lt h o u g h for large files t h is can be expensive. In s o m e cases t h e d u p li ca te records are not identical. In these cases checks for primary key may be r eq u i r ed. If more than one record has the same primary key then it is likely to be because of d u p li ca tes.
A sound t h eo r et i ca l background is be i n g dev e l oped for data cl ean i n g te c hn i q u es. I t has been suggested that data c l ean i n g s h o u l d be based on the following five s tep s :
1. Parsing: Parsing identifies various components of the source data files and t he n es t a b lis h es re la t i o ns h i p s bet w ee n t h o s e and t h e fields in t h e t a r get fil es. Th e classical ex a m p l e of p a rs i n g is identifying the various components of a pe rs o n‘s name and add ress. 2. Correcting: Correcting the i de nt i f i ed components is usually based on a v a ri et y of s op h is t i ca ted te c hn i q u es including ma t h e ma t i c a l a l go r it h ms. Correcting may involve use of other r e la ted i nf o r ma t i o n that may be available in the e nte r p ri s e. 3. Standardizing: Business rules of the e nte r p ri s e may now be used to t ran s f o r m the data to standard form. For example, in some c o m p an i es there mi g ht be rules on how name and address are to be r ep r es e nted. 4. Matching: Much of the data extracted from a number of source systems is lik el y to be r ela ted. Such data needs to be ma t c h ed. 5. Consolidating: A^ ll^ c^ o^ rr^ e^ cted^ ,^ s^ t^ an^ d^ a^ r^ d^ iz^ ed^ and^ ma^ t^ c^ h^ ed^ d^ a^ t^ a^ can^ now be c o ns o li d a ted to build a single version of the e nte r p ri s e d a t a.
Selecting an ETL Tool
Selection of an a pp rop r ia te ETL Tool is an i mpo rt ant de c is i o n that has to be made in c h oo s i n g c o mpo n e nt s of an ODS or d a t a wa r e h o us i n g a pp li ca t i o n. The E TL too l is req u i r ed to p r o v i de c oo r d i na ted access to multiple data sources so that r e l ev ant data may
i nf o r ma t i o n managers.
they require, it is likely to red u c e t h e b ur eau cra c y a ro un d t h e
· To p r o v i de a d a t a b a s e that s to r e s r e la t iv e l y clean d a t a. By using a good E TL p r o c e ss , the data warehouse s h o ul d have data of high q ua lit y. When errors are d is c o v er ed it may be des ira b l e to correct them directly in the data warehouse and then propagate the c o rr e ct i o ns to the OLTP s ys te ms. · To p r o v i de a database that stores historical data that may have been de l eted from t h e O LT P systems. To i mp r o v e re s po n s e t im e , h is to r i ca l d a t a is usually n ot ret a i n ed in O L TP s ys te ms ot h er t han t ha t which is req u i r ed to respond to customer q u er i es. The data warehouse can then store the data that is purged from the OLTP systems.
A useful way of s h o w i n g the r e la t io n s h i p between OLTP systems, a data warehouse and an ODS is given in Figure 7.2. The data warehouse is more like lo n g term memory of an en t erp r is e. The ob j e ct iv es in b u il d i n g the two systems, O D S and data w a r e ho u s e , are s o m ew ha t s ch em es.
c o nf li ct i n g and t h er e fo r e t h e t wo d a t a b a s es are likely to have d iff er e nt
Data w a r eh o u s e (^) OLTP s ys te m
Figure 7.2 Relationship between OLTP, ODS and DW systems.
In b u il d i n g and ODS, data w a r e h o us i n g is a process of integ ra t in g e nter p r is e-w i de d a t a , o r i g ina t i n g from a variety of s o u rc es , i nto a single repo s ito r y. As shown in Figure 7.3, t h e data warehouse may be a c e n t ra l e nte r p r is e-w i de data warehouse for use by all t h e de c is i o n makers in the e nter p r is e or it may c o ns is t of a number of smaller data w a r e h o u s e (often called data marts or local data warehouses) A d a t a ma rt s to r es i n f o r ma t i o n for a lim ited num be r of s u b j e ct areas. Fo r example, a company m i g ht have a data mart about ma rk et i n g that supports ma rk et i n g and s a les. The data ma rt approach is a tt ra ct iv e since beg i nn i n g with a single data mart is relatively inexpensive and ea s i e r to i m p l e m e nt. A data mart may be used as a proof of data warehouse c o n c ept. Data marts can a ls o cr ea te difficulties by s ett i n g up ―silos of i nf o r ma t i o n‖ a lt h o u g h one may build dependent data ma rt s , which are pop u la ted form the c ent ra l data warehouse. Data marts are often the common approach for b u il d i n g a data warehouse since t h e c o s t curve for d a t a ma r t s tends to be m o r e linear. A c e nt ra liz ed d a t a warehouse p r o j e ct can be very r e s o u rc e i nten s iv e and r eq u ir e s s i g n if i cant i nv es t m e nt at t h e beg i nn i n g a lt h o u g h overall costs over a number of years for a c ent ra li z ed data w a r e ho u s e and for de c e nt ra liz ed data marts are likely to be similar. A c e nt ra liz ed wa r eh o us e can p r o v i de bette r q ua lit y d a t a and minimize d a t a i n co n s i s ten c i es since the data quality is c o nt r o ll ed centrally. The too ls and procedures for p u tt i n g data in the warehouse can then be better c o nt r o ll ed. Controlling data quality w it h a de c e nt ra liz ed approach is obviously more difficult. As with any c e nt ra liz ed f un ct i o n , t h o ug h, the unit s or branches of an e nter p ris e may feel no o wn e r s h i p of the c e nt ra liz ed wa r eh o us e may in s o m e cases n ot fu ll y c oope ra te wit h t h e a d mi n i s t ra t i o n of t h e w a re ho us e. A ls o , ma i nt a in i n g a ce nt ra liz ed wa r e h o us e would r eq u i r e considerable c oo r d i na t i o n among the various un it s if the e nte r p ri s e is la r ge and t h is c oo r d i na t i o n may incur significant costs for the en t erp r is e. A s a n ex a m p l e of a d a t a w a reh o u s e a pp li ca t i o n we co n s ider th e te l e co mm uni ca t i o ns i n d us t r y which in m o s t c o unt r i es has be c o m e very c o m pet it iv e d u r i n g t h e la s t few years. If a company is able to identify a ma r k et t r e n d bef o r e it s c o mpet ito r s do , then that can l ea d to a c o m pet it iv e a d v ant a ge. What is therefore needed is to ana l ys e customer needs and be ha v i o u r in an attempt to bette r understand wha t t h e
B us i n e ss In teT lli oo ge l s n ce
E T L p r o c ess E xtr a c t T r ansf o r m and L o a d
E T L p r o c ess
E T L p r o c ess (^) p rE o^ T c^ L ess
DM a a t r at
DM a a t a rt
DM a a t r at
BI Tool
BI Tool
BI Tool
Da il y Change P r o c ess
(S tagi ng A r ea )
Da il y Change P r o c ess OData pe r a S t i ot o n r a el ( O D S)
W a^ D r e^ a h^ t o^ a use ( DW)
Figure 7.4 Another structure for ODS and DW
The a rc h ite ct ur e of a system that i n cl u de s an ODS and a data warehouse shown in Figure 7.4 is more complex. It involves e x t ra ct i n g i nf o r m a t io n from source systems by us i n g an ETL process and then s to r i n g the i nf o rm a t io n in a s t a g i n g d a t a b a s e. The daily changes also come to the s t a g i n g area. Another ETL process is used to transform i nf o r ma t i o n from t h e s t a g i n g area to pop ula te t h e O DS. The O DS is t he n used for supplying i nf o r ma t i o n via another ETL process to the area warehouse which in turn feeds a num be r of data marts that generate the reports r eq u ir ed by m ana ge m e nt. It s h o u l d be noted t ha t not all ETL processes in t h is a rc h ite ct u r e involve data cleaning, some may only involve data e x t ra ct io n and t rans fo r ma t i o n to s u it the target systems.
1.5 DATA WAREHOUSE DESIGN
There are a number of ways of c o n c ept ua liz i n g a data w a r e h o us e. One approach is to view it as a t h r ee- l ev e l s t r u ct ur e. The l o w es t level c o ns is t s of t h e O L T P and legacy systems, the m i dd l e level c o ns is t s of the g l ob a l or c e nt ra l data warehouse while the top level c o ns i s t s of local data warehouses or data ma rt s. Another approach is po ss i b l e if t h e e n ter p ri s e has an ODS. The three levels then m ig ht c o ns is t of OLTP and legacy s ys te ms at the botto m , the ODS in the mi dd l e and the data warehouse at the top. Whatever the a rc h itect u r e , a data warehouse needs to have a data m ode l that can form the b a s is for i m p l e m e nt in g it. To de v e lop a data m odel we view a data warehouse as a m u lt i d i m en s io na l structure c o ns is t i n g of d i m e n s i o ns , since that is an intuitive model t ha t ma t c h e s t h e t ypes of O LA P q u e r i es posed by mana ge m e nt. A d i m e n s i o n is an o r d i na te wit hi n a m ult i d i m en s io na l s t r uct u r e c o ns is t i n g of a lis t of o r de r ed values (s o m et i m es called members) j u s t like the x-axis and y-axis values on a t w o-d i m e ns i o na l g r ap h.
A characteristic of a star schema is that all the d i m e ns i o ns directly link to the fact t a b l e. The fact t a b l e may look like t a b l e 7.1 and the d i m e ns i o n t a b l es may look Tables 7.2 to 7.5.
Table 7.1 An^ e^ x^ a^ m^ p^ l^ e^ of^ the fact^ t^ a^ b^ l^ e
_ Year Degree name Country name Scholarship name Number 200301 BSc Australia Govt 35 199902 MBBS Canada None 50 200002 LLB USA ABC 22 199901 BCom UK Commonwealth 7 200102 LLB Australia Equity 2
The first d i m e ns i o n is the degree d im e n s i o n. An ex a m p l e of t hi s d im e n s i o n t a b l e is Table 7.2.
Table 7.2 An e x a m p l e of the degree d i m e ns i o n t a b l e
_ Name Faculty Scholarship eligibility Number of semesters B S c S c i en c e Y es 6 M BB S Medicine No 10 LLB La w Y es 8 B Co m LLB
Business No 6 Arts No 6
We now present the second d im e n s i o n , the country d i m e ns i o n. An e x a m p l e of t h is d i m en s i o n t a b l e is Table 7.3. Table 7.3 An e x a m p l e of the country d im e ns i o n t a b l e
N a m e religion N epa l I n do n es ia
Continent E du c at io n Level
Asia Lo w Asia Lo w
Major
Hinduism Islam N o rw a y S in g a po r e C o l o mb ia
Eu rope Asia South America
H ig h H ig h Lo w
Ch ris t i a n it y N U LL Ch ris t i a n it y
The t h ir d d i m e ns i o n is the s c h o la r s h i p d im e ns i o n. The d im e n s i o n t a b l e is given in Table 7.4.
Table 7.4 An e x a m p l e of the s c ho la rs h i p d i m e ns i o n t a b l e
_ Name Amount (%) Scholarship eligibility Number
Colombo 100 All 6 Equity 100 Low income 10 Asia 50 Top 5 % 8 Merit 75 Top 5 % 5 Bursary 25 Low income 12
The fourth d i m en s i o n is the year d i m e ns i o n. The d i m e ns i o n t a b l e is given in Table 7.5.
Table 7.5 An^ ex^ a^ m^ p^ l^ e^ of^ the year^ d^ im^ e^ ns^ i^ o^ n^ t^ a^ b^ l^ e
N a m e New programs 2001 2002
J o u rn a lis m M u l t im ed i a