









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Duplication Elimination, BSN Method, Problems due to data duplication, Non Unique PK, House Holding, Individualization, Formal definition and Nomenclature. Some Other terms are also described in these data warehousing lecture slides.
Typology: Slides
1 / 17
This page cannot be seen from the preview
Don't miss anything!










1
2
4
Unable to determine customer relationships (CRM)
Unable to analyze employee benefits trends
Name Phone Number Cust. No. M. Ismail Siddiqi 021.666.1244 780701 M. Ismail Siddiqi 021.666.1244 780203 M. Ismail Siddiqi 021.666.1244 780009
Bonus Date Name Department Emp. No. Jan. 2000 Khan Muhammad 213 (MKT) 5353536 Dec. 2001 Khan Muhammad 567 (SLS) 4577833 Mar. 2002 Khan Muhammad^ 349 (HR) 3457642
Data Duplication: Non-Unique PK
5
Data Duplication: House Holding
Why bother?
7
Problem statement:
Many names, such as:
Current market and tools heavily centered
towards customer lists.
8
10
Basic Sorted Neighborhood (BSN) Method
Concatenate data into one sequential list of N records
Steps 1: Create Keys Compute a key for each record in the list by extracting relevant fields or portions of fields
Effectiveness of the this method highly depends on a properly chosen key
Step 2: Sort Data Sort the records in the data list using the key of step 1
Step 3: Merge Move a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window
If the size of the window is w records then every new record entering the window is compared with the previous w-1 records.
11
Current window
of records w Next window of records w
13
Technology Tech. Techno. Tchnlgy
14
No Name Address Gender 1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M 2 Syed Noman 420 4 Rwp Scheme M 3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
No Name Address Gender 1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M 3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
If contents of fields are not properly ordered, similar records will NOT fall in the same window. Example: Records 1 and 2 are similar but will occur far apart.
Solution is to TOKENize the fields i.e. break them further. Use the tokens in different fields for sorting to fix the error. Example: Either using the name or the address field records 1 and 2 will fall close.
16
Time Complexity: O(n log n)
At least three passes required on the dataset.
Complexity or rule and window size detrimental.
For large sets disk I/O is detrimental.
17
To specify the inferences we need equational Theory.