Hash-Based Improvements - Advanced Database System - Lecture Slides, Slides of Database Management Systems (DBMS)

Some concept of Advanced Database System are Types Supported, Simple Data Model, Concurrency Control Two, Continuously Adaptive, Cost-Based Optimization, Data Access From Disks, Data Warehousing. Main points of this lecture are: Hash-Based Improvements, Memory, Condition, Picture, Item Counts, Frequent Items, Bitmap, Organize Main Memory, Many Integers, Representing Buckets

Typology: Slides

2012/2013

Uploaded on 04/27/2013

dhanapati
dhanapati 🇮🇳

4.1

(24)

123 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Hash-Based Improvements to A-
Priori
1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Hash-Based Improvements - Advanced Database System - Lecture Slides and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Hash-Based Improvements to A-

Priori

PCY Algorithm

  • Hash-based improvement to A-Priori.
  • During Pass 1 of A-priori, most memory is idle.
  • Use that memory to keep counts of buckets into which pairs of items are hashed. - Just the count, not the pairs themselves.
  • Gives extra condition that candidate pairs must satisfy on Pass 2.

PCY Algorithm --- Before Pass 1

  • Organize main memory:
    • Space to count each item.
      • One (typically) 4-byte integer per item.
    • Use the rest of the space for as many integers, representing buckets, as we can.

PCY Algorithm --- Pass 1

FOR (each basket) {

FOR (each item) add 1 to item’s count; FOR (each pair of items) { hash the pair to a bucket; add 1 to the count for that bucket }

}

PCY Algorithm --- Pass 2

  • Count all pairs { i , j } that meet the conditions:
    1. Both i and j are frequent items.
    2. The pair { i , j }, hashes to a bucket number whose bit in the bit vector is 1.
  • Notice all these conditions are necessary for the pair to have a chance of being frequent.

Memory Details

  • Hash table requires buckets of 2-4 bytes.
    • Number of buckets thus almost 1/4-1/2 of the number of bytes of main memory.
  • On second pass, a table of (item, item, count) triples is essential. - Thus, we need to eliminate 2/3 of the candidate pairs to beat a-priori.

Multistage Picture

10

First hash table

Second hash table

Item counts Bitmap 1 Bitmap 1 Bitmap 2

Freq. items Freq. items

Counts of Candidate pairs

Multistage --- Pass 3

  • Count only those pairs { i , j } that satisfy:
    1. Both i and j are frequent items.
    2. Using the first hash function, the pair hashes to a bucket whose bit in the first bit-vector is
    3. Using the second hash function, the pair hashes to a bucket whose bit in the second bit-vector is 1.

Multihash

  • Key idea: use several independent hash tables on the first pass.
  • Risk: halving the number of buckets doubles the average count. We have to be sure most buckets will still not reach count s.
  • If so, we can get a benefit like multistage, but in only 2 passes.

Multihash Picture

14

First hash table Second hash table

Item counts Bitmap 1 Bitmap 2

Freq. items

Counts of Candidate pairs

All (Or Most) Frequent Itemsets In

< 2 Passes

  • Simple algorithm.
  • SON (Savasere, Omiecinski, and Navathe).
  • Toivonen.

Simple Algorithm --- (1)

  • Take a main-memory-sized random sample of the market baskets.
  • Run a-priori or one of its improvements (for sets of all sizes, not just pairs) in main memory, so you don’t pay for disk I/O each time you increase the size of itemsets. - Be sure you leave enough space for counts.

Simple Algorithm --- (2)

  • Use as your support threshold a suitable, scaled- back number. - E.g., if your sample is 1/100 of the baskets, use s /100 as your support threshold instead of s.
  • Verify that your guesses are truly frequent in the entire data set by a second pass.
  • But you don’t catch sets frequent in the whole but not in the sample. - Smaller threshold, e.g., s /125, helps.

SON Algorithm --- (1)

  • Repeatedly read small subsets of the baskets into main memory and perform the first pass of the simple algorithm on each subset.
  • An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets.