Data Duplication Elimination, BSN Method - Data Warehousing - Lecture Slides, Slides of Data Warehousing

Data Duplication Elimination, BSN Method, Problems due to data duplication, Non Unique PK, House Holding, Individualization, Formal definition and Nomenclature. Some Other terms are also described in these data warehousing lecture slides.

Typology: Slides

2011/2012

Uploaded on 11/03/2012

padmal
padmal 🇮🇳

4.5

(15)

75 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Data Warehousing
Lecture-20
Data Duplication Elimination & BSN Method
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Data Duplication Elimination, BSN Method - Data Warehousing - Lecture Slides and more Slides Data Warehousing in PDF only on Docsity!

1

Data Warehousing

Lecture-

Data Duplication Elimination & BSN Method

2

Why data duplicated?

A data warehouse is created from heterogeneous sources,

with heterogeneous databases (different

schema/representation) of the same entity.

The data coming from outside the organization owning the

DWH, can have even lower quality data i.e. different

representation for same entity, transcription or typographical

errors.

4

Unable to determine customer relationships (CRM)

Unable to analyze employee benefits trends

Name Phone Number Cust. No. M. Ismail Siddiqi 021.666.1244 780701 M. Ismail Siddiqi 021.666.1244 780203 M. Ismail Siddiqi 021.666.1244 780009

Bonus Date Name Department Emp. No. Jan. 2000 Khan Muhammad 213 (MKT) 5353536 Dec. 2001 Khan Muhammad 567 (SLS) 4577833 Mar. 2002 Khan Muhammad^ 349 (HR) 3457642

  • Duplicate Identification Numbers
    • Multiple Customer Numbers
  • Multiple Employee Numbers

Data Duplication: Non-Unique PK

5

Data Duplication: House Holding

 Group together all records that belong to the same

household.

Why bother?

……… S. Ahad^ 440, Munir Road, Lahore

……… ………….…^ ………………………………

……… Shiekh Ahad^ No. 440, Munir Rd, Lhr

……… Shiekh^ Ahed^ House # 440, Munir Road, Lahore

……… ………….…^ ………………………………

7

Formal definition & Nomenclature

 Problem statement:

 “Given two databases, identify the potentially

matched records Efficiently and Effectively”

 Many names, such as:

 Record linkage

 Merge/purge

 Entity reconciliation

 List washing and data cleansing.

 Current market and tools heavily centered

towards customer lists.

8

Need & Tool Support

 Logical solution to dirty data is to clean it in some way.

 Doing it manually is very slow and prone to errors.

 Tools are required to do it “cost” effectively to achieve

reasonable quality.

 Tools are there, some for specific fields, others for specific

cleaning phase.

 Since application specific, so work very well, but need

support from other tools for broad spectrum of cleaning

problems.

10

Basic Sorted Neighborhood (BSN) Method

Concatenate data into one sequential list of N records

Steps 1: Create KeysCompute a key for each record in the list by extracting relevant fields or portions of fields

Effectiveness of the this method highly depends on a properly chosen key

Step 2: Sort DataSort the records in the data list using the key of step 1

Step 3: MergeMove a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window

If the size of the window is w records then every new record entering the window is compared with the previous w-1 records.

11

BSN Method : Sliding Window

Current window

of records w Next window of records w

13

BSN Method: Problem with keys

 Since data is dirty, so keys WILL also be dirty, and

matching records will not come together.

 Data becomes dirty due to data entry errors or use of

abbreviations. Some real examples are as follows:

 Solution is to use external standard source files to validate the

data and resolve any data conflicts.

Technology Tech. Techno. Tchnlgy

14

BSN Method: Problem with keys (e.g.)

No Name Address Gender 1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M 2 Syed Noman 420 4 Rwp Scheme M 3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F

No Name Address Gender 1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M

2 S. Noman 420, Scheme 4, Rwp M 3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F

If contents of fields are not properly ordered, similar records will NOT fall in the same window. Example: Records 1 and 2 are similar but will occur far apart.

Solution is to TOKENize the fields i.e. break them further. Use the tokens in different fields for sorting to fix the error. Example: Either using the name or the address field records 1 and 2 will fall close.

16

 Time Complexity: O(n log n)

 O (n) for Key Creation

 O (n log n) for Sorting

 O (w n) for matching, where w ≤ 2 ≤ n

 Constants vary a lot

 At least three passes required on the dataset.

 Complexity or rule and window size detrimental.

 For large sets disk I/O is detrimental.

Complexity Analysis of BSN Method

17

BSN Method: Equational Theory

To specify the inferences we need equational Theory.

 Logic is NOT based on string equivalence.

 Logic based on domain equivalence.

 Requires declarative rule language.