Data Mining - Advanced Database System - Lecture Slides, Slides of Database Management Systems (DBMS)

Some concept of Advanced Database System are Types Supported, Simple Data Model, Concurrency Control Two, Continuously Adaptive, Cost-Based Optimization, Data Access From Disks, Data Warehousing. Main points of this lecture are: Data Mining, Subsidiary Issues, Data Cleansing, Visualization, Warehousing of Data, Megabyte, Bogus Data, Decision Trees, Clusters, Hidden-Markov

Typology: Slides

2012/2013

Uploaded on 04/27/2013

dhanapati
dhanapati 🇮🇳

4.1

(24)

123 documents

1 / 42

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Advanced Database Systems
Data Mining
1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a

Partial preview of the text

Download Data Mining - Advanced Database System - Lecture Slides and more Slides Database Management Systems (DBMS) in PDF only on Docsity!

Advanced Database Systems

Data Mining

What is Data Mining?

  • Discovery of useful, possibly unexpected, patterns in data.
  • Subsidiary issues:
    • Data cleansing: detection of bogus data.
      • E.g., age = 150.
    • Visualization: something better than megabyte files of output.
    • Warehousing of data (for retrieval).

Example: Clusters

4

x x x x x x x x x x x x x x x

x xx x x x x x x x x x x x

x x x x x x x x x x

x

x

Example: Frequent Itemsets

  • A common marketing problem: examine what people buy together to discover patterns.
  1. What pairs of items are unusually often found together at Kroger checkout?
  • Answer: diapers and beer.
  1. What books are likely to be bought by the same Amazon customer?

Rhine Paradox --- (1)

  • David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception.
  • He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.
  • He discovered that almost 1 in 1000 had ESP --
    • they were able to get all 10 right!

Rhine Paradox --- (2)

  • He told these people they had ESP and called them in for another test of the same type.
  • Alas, he discovered that almost all of them had lost their ESP.
  • What did he conclude?
    • Answer on next slide.

“Association Rules”

Market Baskets Frequent Itemsets A-priori Algorithm

The Market-Basket Model

  • A large set of items , e.g., things sold in a supermarket.
  • A large set of baskets , each of which is a small set of the items, e.g., the things one customer buys on one day.

Support

  • Simplest question: find sets of items that appear “frequently” in the baskets.
  • Support for itemset I = the number of baskets containing all items in I.
  • Given a support threshold^ s , sets of items that appear in > s baskets are called frequent itemsets.

Example

  • Items={milk, coke, pepsi, beer, juice}.
  • Support = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}
  • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

Applications --- (2)

  • “Baskets” = documents; “items” = words in those documents. - Lets us find words that appear together unusually frequently, i.e., linked concepts.
  • “Baskets” = sentences, “items” = documents containing those sentences. - Items that appear together too often could represent plagiarism.

Applications --- (3)

  • “Baskets” = Web pages; “items” = linked pages. - Pairs of pages with many common references may be about the same topic.
  • “Baskets” = Web pages p ; “items” = pages that link to p. - Pages with many of the same links may be mirrors or about the same topic.

Scale of Problem

  • WalMart sells 100,000 items and can store billions of baskets.
  • The Web has over 100,000,000 words and billions of pages.

Association Rules

  • If-then rules about the contents of baskets.
  • { i 1 , i 2 ,…, i (^) k } → j means: “if a basket contains all

of i 1 ,…, i (^) k then it is likely to contain j.

  • Confidence of this association rule is the probability of j given i 1 ,…, i^ k.