Data Mining Primitives - Data Warehousing - Lecture Slide, Slides of Data Warehousing

Some concept of Data Warehousing are Aggregate Functions, Applications and Trends in Data Mining, Classification and Prediction, Cluster Analysis, Data Mining Primitives, Data Warehousing Design. Main points of this lecture are: Data Mining Primitives, Mining Query Language, Design Graphical, Interfaces Based, Query Language, Mining Systems, Architecture of Data, Summary, Patterns Autonomously, Database

Typology: Slides

2012/2013

Uploaded on 04/25/2013

khushia
khushia 🇮🇳

4.1

(10)

110 documents

1 / 41

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining Primitives,
Languages, and System
Architectures
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29

Partial preview of the text

Download Data Mining Primitives - Data Warehousing - Lecture Slide and more Slides Data Warehousing in PDF only on Docsity!

Data Mining Primitives,

Languages, and System

Architectures

Chapter 4: Data Mining Primitives,

Languages, and System Architectures

 Data mining primitives: What defines a data

mining task?

 A data mining query language

 Design graphical user interfaces based on a

data mining query language

 Architecture of data mining systems

 Summary

What Defines a Data Mining Task?

 Task-relevant data

  • Typically interested in only a subset of the entire database
  • Specify  the name of database/data warehouse (AllElectronics_db)  names of tables/data cubes containing relevant data (item, customer, purchases, items_sold)  conditions for selecting the relevant data (purchases made in Canada for relevant year)  relevant attributes or dimensions (name and price from item, income and age from customer)

What Defines a Data Mining Task?

(continued)

 Type of knowledge to be mined

  • Concept description, association, classification, prediction, clustering, and evolution analysis  Studying buying habits of customers, mine associations between customer profile and the items they like to buy - Use this info to recommend items to put on sale to increase revenue  Studying real estate transactions, mine clusters to determine house characteristics that make for fast sales - Use this info to make recommendations to house sellers who want/need to sell their house quickly  Study relationship between individual’s sport statistics and salary - Use this info to help sports agents and sports team owners negotiate an individual’s salary

What Defines a Data Mining Task?

 Task-relevant data

 Type of knowledge to be mined

 Background knowledge

 Pattern interestingness measurements

 Visualization of discovered patterns

Task-Relevant Data (Minable View)

 Database or data warehouse name

 Database tables or data warehouse cubes

 Condition for data selection

 Relevant attributes or dimensions

 Data grouping criteria

Background Knowledge:

Concept Hierarchies

 Allow discovery of knowledge at multiple levels of abstraction

 Represented as a set of nodes organized in a tree

  • Each node represents a concept
  • Special node, all, reserved for root of tree

 Concept hierarchies allow raw data to be handled at a higher, more generalized level of abstraction

 Four major types of concept hierarchies, schema, set- grouping, operation derived, rule based

A Concept Hierarchy: Dimension

(location)

Mexico

all

Europe North_America

Germany Spain Canada

Vancouver

L. Chan M. Wind

all

region

office

country

city Frankfurt Toronto

Define a sequence of mappings from a set of low

level concepts to higher-level, more general concepts

Background Knowledge:

Concept Hierarchies

 Operation-derived hierarchy – based on

operations specified by users, experts, or the

data mining system

  • email address or a URL contains hierarchy info relating departments, universities (or companies) and countries
  • E-mail address  [email protected]
  • Partial concept hierarchy  login-name < department < university < country

Background Knowledge:

Concept Hierarchies

 Rule-based hierarchy – either a whole concept hierarchy or a portion of it is defined by a set of rules and is evaluated dynamically based on the current data and rule definition

  • Following rules used to categorize items as low profit margin, medium profit margin and high profit margin  Low profit margin - < $  Medium profit margin – between $50 & $  High profit margin - > $
  • Rule based concept hierarchy  low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $  medium_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) >= $50 and (P1 – P2) <= $  high_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) > $

Measurements of Pattern

Interestingness (continued)

 Simplicity – A factor contributing to interestingness of

pattern is overall simplicity for comprehension

  • Objective measures viewed as functions of the pattern structure or number of attributes or operators
  • More complex a rule, more difficult it is to interpret, thus less interesting
  • Example measures: rule length or number of leaves in a decision tree

 Certainty – Measure of certainty associated with

pattern that assesses validity or trustworthiness

  • Confidence (A=>B) = # tuples containing both A & B/ #tuples containing A
  • Confidence of 85% for association rule buys (X, computer) => buys (X, software) means 85% of all customers who bought a computer bought software also

Measurements of Pattern

Interestingness (continued)

 Utility – potential usefulness of a pattern is a

factor determining its interestingness

  • Estimated by a utility function such as support – percentage of task relevant data tuples for which pattern is true  Support (A=>B) = # tuples containing both A & B/ total # of tuples

 Novelty – those patterns that contribute new

information or increased performance to the

pattern set

  • not previously known, surprising

A Data Mining Query

Language (DMQL)

 Motivation

  • A DMQL can provide the ability to support ad-hoc and interactive data mining
  • By providing a standardized language like SQL  Hope to achieve a similar effect like that SQL has on relational database  Foundation for system development and evolution  Facilitate information exchange, technology transfer, commercialization and wide acceptance

 Design

  • DMQL is designed with the primitives described earlier

Syntax for DMQL

 Syntax for specification of

  • task-relevant data
  • the kind of knowledge to be mined
  • concept hierarchy specification
  • interestingness measure
  • pattern presentation and visualization

 Putting it all together — a DMQL query