Download Data Mining Primitives - Data Warehousing - Lecture Slide and more Slides Data Warehousing in PDF only on Docsity!
Data Mining Primitives,
Languages, and System
Architectures
Chapter 4: Data Mining Primitives,
Languages, and System Architectures
Data mining primitives: What defines a data
mining task?
A data mining query language
Design graphical user interfaces based on a
data mining query language
Architecture of data mining systems
Summary
What Defines a Data Mining Task?
Task-relevant data
- Typically interested in only a subset of the entire database
- Specify the name of database/data warehouse (AllElectronics_db) names of tables/data cubes containing relevant data (item, customer, purchases, items_sold) conditions for selecting the relevant data (purchases made in Canada for relevant year) relevant attributes or dimensions (name and price from item, income and age from customer)
What Defines a Data Mining Task?
(continued)
Type of knowledge to be mined
- Concept description, association, classification, prediction, clustering, and evolution analysis Studying buying habits of customers, mine associations between customer profile and the items they like to buy - Use this info to recommend items to put on sale to increase revenue Studying real estate transactions, mine clusters to determine house characteristics that make for fast sales - Use this info to make recommendations to house sellers who want/need to sell their house quickly Study relationship between individual’s sport statistics and salary - Use this info to help sports agents and sports team owners negotiate an individual’s salary
What Defines a Data Mining Task?
Task-relevant data
Type of knowledge to be mined
Background knowledge
Pattern interestingness measurements
Visualization of discovered patterns
Task-Relevant Data (Minable View)
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Background Knowledge:
Concept Hierarchies
Allow discovery of knowledge at multiple levels of abstraction
Represented as a set of nodes organized in a tree
- Each node represents a concept
- Special node, all, reserved for root of tree
Concept hierarchies allow raw data to be handled at a higher, more generalized level of abstraction
Four major types of concept hierarchies, schema, set- grouping, operation derived, rule based
A Concept Hierarchy: Dimension
(location)
Mexico
all
Europe North_America
Germany Spain Canada
Vancouver
L. Chan M. Wind
all
region
office
country
city Frankfurt Toronto
Define a sequence of mappings from a set of low
level concepts to higher-level, more general concepts
Background Knowledge:
Concept Hierarchies
Operation-derived hierarchy – based on
operations specified by users, experts, or the
data mining system
- email address or a URL contains hierarchy info relating departments, universities (or companies) and countries
- E-mail address [email protected]
- Partial concept hierarchy login-name < department < university < country
Background Knowledge:
Concept Hierarchies
Rule-based hierarchy – either a whole concept hierarchy or a portion of it is defined by a set of rules and is evaluated dynamically based on the current data and rule definition
- Following rules used to categorize items as low profit margin, medium profit margin and high profit margin Low profit margin - < $ Medium profit margin – between $50 & $ High profit margin - > $
- Rule based concept hierarchy low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $ medium_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) >= $50 and (P1 – P2) <= $ high_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) > $
Measurements of Pattern
Interestingness (continued)
Simplicity – A factor contributing to interestingness of
pattern is overall simplicity for comprehension
- Objective measures viewed as functions of the pattern structure or number of attributes or operators
- More complex a rule, more difficult it is to interpret, thus less interesting
- Example measures: rule length or number of leaves in a decision tree
Certainty – Measure of certainty associated with
pattern that assesses validity or trustworthiness
- Confidence (A=>B) = # tuples containing both A & B/ #tuples containing A
- Confidence of 85% for association rule buys (X, computer) => buys (X, software) means 85% of all customers who bought a computer bought software also
Measurements of Pattern
Interestingness (continued)
Utility – potential usefulness of a pattern is a
factor determining its interestingness
- Estimated by a utility function such as support – percentage of task relevant data tuples for which pattern is true Support (A=>B) = # tuples containing both A & B/ total # of tuples
Novelty – those patterns that contribute new
information or increased performance to the
pattern set
- not previously known, surprising
A Data Mining Query
Language (DMQL)
Motivation
- A DMQL can provide the ability to support ad-hoc and interactive data mining
- By providing a standardized language like SQL Hope to achieve a similar effect like that SQL has on relational database Foundation for system development and evolution Facilitate information exchange, technology transfer, commercialization and wide acceptance
Design
- DMQL is designed with the primitives described earlier
Syntax for DMQL
Syntax for specification of
- task-relevant data
- the kind of knowledge to be mined
- concept hierarchy specification
- interestingness measure
- pattern presentation and visualization
Putting it all together — a DMQL query