Download Data Duplication Elimination, BSN Method - Data Warehousing - Lecture Slides and more Slides Data Warehousing in PDF only on Docsity! 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Docsity.com 2 Why data duplicated? A data warehouse is created from heterogeneous sources, with heterogeneous databases (different schema/representation) of the same entity. The data coming from outside the organization owning the DWH, can have even lower quality data i.e. different representation for same entity, transcription or typographical errors. Docsity.com 5 Data Duplication: House Holding Group together all records that belong to the same household. Why bother ? ……… S. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Shiekh Ahad No. 440, Munir Rd, Lhr ……… Shiekh Ahed House # 440, Munir Road, Lahore ……… ………….… ……………………………… Docsity.com 6 Identify multiple records in each household which represent the same individual Address field is standardized. By coincidence ?? ……… M. Ahad 440, Munir Road, Lahore ……… ………….… ……………………………… ……… Maj Ahad 440, Munir Road, Lahore Data Duplication: Individualization Docsity.com 7 Formal definition & Nomenclature Problem statement: “Given two databases, identify the potentially matched records Efficiently and Effectively” Many names, such as: Record linkage Merge/purge Entity reconciliation List washing and data cleansing. Current market and tools heavily centered towards customer lists. Docsity.com 10 Basic Sorted Neighborhood (BSN) Method Concatenate data into one sequential list of N records Steps 1: Create Keys Compute a key for each record in the list by extracting relevant fields or portions of fields Effectiveness of the this method highly depends on a properly chosen key Step 2: Sort Data Sort the records in the data list using the key of step 1 Step 3: Merge Move a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window If the size of the window is w records then every new record entering the window is compared with the previous w-1 records. Docsity.com 11 BSN Method : Sliding Window . . . . . . Current window of records w Next window of records w Docsity.com 12 BSN Method: Selection of Keys Selection of Keys Effectiveness highly dependent on the key selected to sort the records middle name vs. family name, A key is a sequence of a subset of attributes or sub-strings within the attributes chosen from the record. The keys are used for sorting the entire dataset with the intention that matched candidates will appear close to each other. First Middle Address NID Key Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345 Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345 Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345 Docsity.com 15 BSN Method: Matching Candidates Merging of records is a complex inferential process. Example-1: Two persons with names spelled nearly but not identically, have the exact same address. We infer they are same person i.e. Noma Abdullah and Noman Abdullah. Example-2: Two persons have same National ID numbers but names and addresses are completely different. We infer same person who changed his name and moved or the records represent different persons and NID is incorrect for one of them. Use of further information such as age, gender etc. can alter the decision. Example-3: Noma-F and Noman-M we could perhaps infer that Noma and Noman are siblings i.e. brothers and sisters. Noma-30 and Noman-5 i.e. mother and son. Docsity.com 16 Time Complexity: O(n log n) O (n) for Key Creation O (n log n) for Sorting O (w n) for matching, where w ≤ 2 ≤ n Constants vary a lot At least three passes required on the dataset. Complexity or rule and window size detrimental. For large sets disk I/O is detrimental. Complexity Analysis of BSN Method Docsity.com 17 BSN Method: Equational Theory To specify the inferences we need equational Theory. Logic is NOT based on string equivalence. Logic based on domain equivalence. Requires declarative rule language. Docsity.com