Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Duplication Elimination, BSN Method - Data Warehousing - Lecture Slides, Slides of Data Warehousing

Data Duplication Elimination, BSN Method, Problems due to data duplication, Non Unique PK, House Holding, Individualization, Formal definition and Nomenclature. Some Other terms are also described in these data warehousing lecture slides.

Typology: Slides

2011/2012
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 11/03/2012

padmal
padmal 🇮🇳

4.5

(15)

86 documents

1 / 17

Related documents


Partial preview of the text

Download Data Duplication Elimination, BSN Method - Data Warehousing - Lecture Slides and more Slides Data Warehousing in PDF only on Docsity! 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Docsity.com 2 Why data duplicated? A data warehouse is created from heterogeneous sources, with heterogeneous databases (different schema/representation) of the same entity. The data coming from outside the organization owning the DWH, can have even lower quality data i.e. different representation for same entity, transcription or typographical errors. Docsity.com 5 Data Duplication: House Holding  Group together all records that belong to the same household. Why bother ? ……… S. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Shiekh Ahad No. 440, Munir Rd, Lhr ……… Shiekh Ahed House # 440, Munir Road, Lahore ……… ………….… ……………………………… Docsity.com 6  Identify multiple records in each household which represent the same individual Address field is standardized. By coincidence ?? ……… M. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Maj Ahad 440, Munir Road, Lahore Data Duplication: Individualization Docsity.com 7 Formal definition & Nomenclature  Problem statement:  “Given two databases, identify the potentially matched records Efficiently and Effectively”  Many names, such as:  Record linkage  Merge/purge  Entity reconciliation  List washing and data cleansing.  Current market and tools heavily centered towards customer lists. Docsity.com 10 Basic Sorted Neighborhood (BSN) Method  Concatenate data into one sequential list of N records  Steps 1: Create Keys  Compute a key for each record in the list by extracting relevant fields or portions of fields  Effectiveness of the this method highly depends on a properly chosen key  Step 2: Sort Data  Sort the records in the data list using the key of step 1  Step 3: Merge  Move a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window  If the size of the window is w records then every new record entering the window is compared with the previous w-1 records. Docsity.com 11 BSN Method : Sliding Window . . . . . . Current window of records w Next window of records w Docsity.com 12 BSN Method: Selection of Keys  Selection of Keys  Effectiveness highly dependent on the key selected to sort the records middle name vs. family name,  A key is a sequence of a subset of attributes or sub-strings within the attributes chosen from the record.  The keys are used for sorting the entire dataset with the intention that matched candidates will appear close to each other. First Middle Address NID Key Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345 Docsity.com 15 BSN Method: Matching Candidates Merging of records is a complex inferential process. Example-1: Two persons with names spelled nearly but not identically, have the exact same address. We infer they are same person i.e. Noma Abdullah and Noman Abdullah. Example-2: Two persons have same National ID numbers but names and addresses are completely different. We infer same person who changed his name and moved or the records represent different persons and NID is incorrect for one of them. Use of further information such as age, gender etc. can alter the decision. Example-3: Noma-F and Noman-M we could perhaps infer that Noma and Noman are siblings i.e. brothers and sisters. Noma-30 and Noman-5 i.e. mother and son. Docsity.com 16  Time Complexity: O(n log n)  O (n) for Key Creation  O (n log n) for Sorting  O (w n) for matching, where w ≤ 2 ≤ n  Constants vary a lot  At least three passes required on the dataset.  Complexity or rule and window size detrimental.  For large sets disk I/O is detrimental. Complexity Analysis of BSN Method Docsity.com 17 BSN Method: Equational Theory To specify the inferences we need equational Theory.  Logic is NOT based on string equivalence.  Logic based on domain equivalence.  Requires declarative rule language. Docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved