Data Ware House Mining, Lecture notes of Data Warehousing

All the important questions cover in this notes

Typology: Lecture notes

2018/2019

Uploaded on 10/26/2019

prajakta-bagde
prajakta-bagde 🇮🇳

1 document

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Data Ware House Mining and more Lecture notes Data Warehousing in PDF only on Docsity!

CLUSTERING

Cluster is a group of objects that belongs to the same class. In other words, similar objects are

grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

 A cluster of data objects can be treated as one group.

 While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.

 The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.

 Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.

 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations.

 Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.

 Clustering also helps in classifying documents on the web for information discovery.

 Clustering is also used in outlier detection applications such as detection of credit card fraud.

 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −

Scalability − We need highly scalable clustering algorithms to deal with large databases.

Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.

Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.

High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.

Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.

Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods

Clustering methods can be classified into the following categories −

 Partitioning Method  Hierarchical Method  Density-based Method  Grid-Based Method  Model-Based Method  Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition

of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into k

groups, which satisfy the following requirements −

 Each group contains at least one object.

 Each object must belong to exactly one group.

Points to remember −

 For a given number of partitions (say k), the partitioning method will create an initial partitioning.

 Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We can classify

hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two

approaches here −

 Agglomerative Approach  Divisive Approach

Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object forming a

separate group. It keeps on merging the objects or groups that are close to one another. It keep on

doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the objects in the

same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until

Audience

This tutorial has been prepared for computer science graduates to help them understand the basic-

to-advanced concepts related to data mining.

Prerequisites

Before proceeding with this tutorial, you should have an understanding of the basic database

concepts such as schema, ER model, Structured Query language and a basic knowledge of Data

Warehousing concepts.

There is a huge amount of data available in the Information Industry. This data is of no use until it is

converted into useful information. It is necessary to analyze this huge amount of data and extract

useful information from it.

Extraction of information is not the only process we need to perform; data mining also involves other

processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern

Evaluation and Data Presentation. Once all these processes are over, we would be able to use this

information in many applications such as Fraud Detection, Market Analysis, Production Control,

Science Exploration, etc.

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we can say

that data mining is the procedure of mining knowledge from data. The information or knowledge

extracted so can be used for any of the following applications −

 Market Analysis  Fraud Detection  Customer Retention  Production Control  Science Exploration

Data Mining Applications

Data mining is highly useful in the following domains −

 Market Analysis and Management  Corporate Analysis & Risk Management  Fraud Detection

Apart from these, data mining can also be used in the areas of production control, customer retention,

science exploration, sports, astrology, and Internet Web Surf-Aid

Market Analysis and Management

Listed below are the various fields of market where data mining is used −

Customer Profiling − Data mining helps determine what kind of people buy what kind of products.

Identifying Customer Requirements − Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.

Cross Market Analysis − Data mining performs Association/correlations between product sales.

Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.

Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.

Providing Summary Information − Data mining provides us various multidimensional summary reports.

Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −

Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.

Resource Planning − It involves summarizing and comparing the resources and spending.

Competition − It involves monitoring competitors and market directions.

Fraud Detection

Data mining is also used in the fields of credit card services and telecommunication to detect frauds.

In fraud telephone calls, it helps to find the destination of the call, duration of the call, time of the day

or week, etc. It also analyzes the patterns that deviate from expected norms.

Data Mining - Tasks

Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data to be

mined, there are two categories of functions involved in Data Mining −

 Descriptive  Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the list of

descriptive functions −

 Class/Concept Description  Mining of Frequent Patterns  Mining of Associations  Mining of Correlations  Mining of Clusters

Class/Concept Description

 Mathematical Formulae  Neural Networks

The list of functions involved in these processes are as follows −

Classification − It predicts the class of objects whose class label is unknown. Its objective is to find a derived model that describes and distinguishes data classes or concepts. The Derived Model is based on the analysis set of training data i.e. the data object whose class label is well known.

Prediction − It is used to predict missing or unavailable numerical data values rather than class labels. Regression Analysis is generally used for prediction. Prediction can also be used for identification of distribution trends based on available data.

Outlier Analysis − Outliers may be defined as the data objects that do not comply with the general behavior or model of the data available.

Evolution Analysis − Evolution analysis refers to the description and model regularities or trends for objects whose behavior changes over time.

Data Mining Task Primitives

 We can specify a data mining task in the form of a data mining query.  This query is input to the system.  A data mining query is defined in terms of data mining task primitives.

Note − These primitives allow us to communicate in an interactive manner with the data mining

system. Here is the list of Data Mining Task Primitives −

 Set of task relevant data to be mined.  Kind of knowledge to be mined.  Background knowledge to be used in discovery process.  Interestingness measures and thresholds for pattern evaluation.  Representation for visualizing the discovered patterns.

Set of task relevant data to be mined

This is the portion of database in which the user is interested. This portion includes the following −

 Database Attributes  Data Warehouse dimensions of interest

Kind of knowledge to be mined

It refers to the kind of functions to be performed. These functions are −

 Characterization  Discrimination  Association and Correlation Analysis  Classification  Prediction  Clustering  Outlier Analysis

 Evolution Analysis

Background knowledge

The background knowledge allows data to be mined at multiple levels of abstraction. For example,

the Concept hierarchies are one of the background knowledge that allows data to be mined at

multiple levels of abstraction.

Interestingness measures and thresholds for pattern evaluation

This is used to evaluate the patterns that are discovered by the process of knowledge discovery.

There are different interesting measures for different kind of knowledge.

Representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed. These representations may

include the following. −

 Rules  Tables  Charts  Graphs  Decision Trees  Cubes

Data Mining - Issues

Data mining is not an easy task, as the algorithms used can get very complex and data is not always

available at one place. It needs to be integrated from various heterogeneous data sources. These

factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −

 Mining Methodology and User Interaction  Performance Issues  Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.

suppliers, sales, revenue, etc. The data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision-making.

Integrated − Data warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhances the effective analysis of data.

Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from a historical point of view.

Non-volatile − Nonvolatile means the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in the data warehouse.

Data Warehousing

Data warehousing is the process of constructing and using the data warehouse. A data warehouse

is constructed by integrating the data from multiple heterogeneous sources. It supports analytical

reporting, structured and/or ad hoc queries, and decision making.

Data warehousing involves data cleaning, data integration, and data consolidations. To integrate

heterogeneous databases, we have the following two approaches −

 Query Driven Approach  Update Driven Approach

Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This approach is used to build

wrappers and integrators on top of multiple heterogeneous databases. These integrators are also

known as mediators.

Process of Query Driven Approach

 When a query is issued to a client side, a metadata dictionary translates the query into the queries, appropriate for the individual heterogeneous site involved.

 Now these queries are mapped and sent to the local query processor.

 The results from heterogeneous sites are integrated into a global answer set.

Disadvantages

This approach has the following disadvantages −

 The Query Driven Approach needs complex integration and filtering processes.

 It is very inefficient and very expensive for frequent queries.

 This approach is expensive for queries that require aggregations.

Update-Driven Approach

Today's data warehouse systems follow update-driven approach rather than the traditional approach

discussed earlier. In the update-driven approach, the information from multiple heterogeneous

sources is integrated in advance and stored in a warehouse. This information is available for direct

querying and analysis.

Advantages

This approach has the following advantages −

 This approach provides high performance.

 The data can be copied, processed, integrated, annotated, summarized and restructured in the semantic data store in advance.

Query processing does not require interface with the processing at local sources.

From Data Warehousing (OLAP) to Data Mining (OLAM)

Online Analytical Mining integrates with Online Analytical Processing with data mining and mining

knowledge in multidimensional databases. Here is the diagram that shows the integration of both

OLAP and OLAM −

Importance of OLAM

OLAM is important for the following reasons −

High quality of data in data warehouses − The data mining tools are required to work on integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of data. The data warehouses constructed by such preprocessing are valuable sources of high quality data for OLAP and data mining as well.

Available information processing infrastructure surrounding data warehouses − Information processing infrastructure refers to accessing, integration, consolidation, and transformation of multiple heterogeneous databases, web-accessing and service facilities, reporting and OLAP analysis tools.

OLAP−based exploratory data analysis − Exploratory data analysis is required for effective data mining. OLAM provides facility for data mining on various subset of data and at different levels of abstraction.

Online selection of data mining functions − Integrating OLAP with multiple data mining functions and online analytical mining provide users with the flexibility to select desired data mining functions and swap data mining tasks dynamically.

Data Mining - Terminologies

Data Mining

Data mining is defined as extracting the information from a huge set of data. In other words we can

say that data mining is mining the knowledge from data. This information can be used for any of the

following applications −

 Market Analysis

 Evaluate mined patterns.  Visualize the patterns in different forms.

Data Integration

Data Integration is a data preprocessing technique that merges the data from multiple heterogeneous

data sources into a coherent data store. Data integration may involve inconsistent data and therefore

needs data cleaning.

Data Cleaning

Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies

in data. Data cleaning involves transformations to correct the wrong data. Data cleaning is performed

as a data preprocessing step while preparing the data for a data warehouse.

Data Selection

Data Selection is the process where data relevant to the analysis task are retrieved from the

database. Sometimes data transformation and consolidation are performed before the data selection

process.

Clusters

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects

that are very similar to each other but are highly different from the objects in other clusters.

Data Transformation

In this step, data is transformed or consolidated into forms appropriate for mining, by performing

summary or aggregation operations.

Data Mining - Systems

There is a large variety of data mining systems available. Data mining systems may integrate

techniques from the following −

 Spatial Data Analysis  Information Retrieval  Pattern Recognition  Image Analysis  Signal Processing  Computer Graphics  Web Technology  Business  Bioinformatics

Data Mining System Classification

A data mining system can be classified according to the following criteria −

 Database Technology  Statistics  Machine Learning  Information Science  Visualization  Other Disciplines

Apart from these, a data mining system can also be classified based on the kind of (a) databases

mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.

Classification Based on the Databases Mined

We can classify a data mining system according to the kind of databases mined. Database system

can be classified according to different criteria such as data models, types of data, etc. And the data

mining system can be classified accordingly.

For example, if we classify a database according to the data model, then we may have a relational,

transactional, object-relational, or data warehouse mining system.

Classification Based on the kind of Knowledge Mined

We can classify a data mining system according to the kind of knowledge mined. It means the data

mining system is classified on the basis of functionalities such as −

 Characterization  Discrimination  Association and Correlation Analysis  Classification  Prediction  Outlier Analysis  Evolution Analysis

Classification Based on the Techniques Utilized

We can classify a data mining system according to the kind of techniques used. We can describe

these techniques according to the degree of user interaction involved or the methods of analysis

employed.

Classification Based on the Applications Adapted

We can classify a data mining system according to the applications adapted. These applications are

as follows −

from relation(s)/cube(s) [where condition]

order by order_list

group by grouping_list

Syntax for Specifying the Kind of Knowledge

Here we will discuss the syntax for Characterization, Discrimination, Association, Classification, and

Prediction.

Characterization

The syntax for characterization is −

mine characteristics [as pattern_name]

analyze {measure(s) }

The analyze clause, specifies aggregate measures, such as count, sum, or count%.

For example −

Description describing customer purchasing habits.

mine characteristics as customerPurchasing

analyze count%

Discrimination

The syntax for Discrimination is −

mine comparison [as {pattern_name]}

For {target_class } where {t arget_condition }

{versus {contrast_class_i }

where {contrast_condition_i}}

analyze {measure(s) }

For example, a user may define big spenders as customers who purchase items that cost $100 or

more on an average; and budget spenders as customers who purchase items at less than $100 on

an average. The mining of discriminant descriptions for customers from each of these categories can

be specified in the DMQL as −

mine comparison as purchaseGroups

for bigSpenders where avg(I.price) ≥$

versus budgetSpenders where avg(I.price)< $

analyze count

Association

The syntax for Association is−

mine associations [ as {pattern_name} ]

{matching {metapattern} }

For Example −

mine associations as buyingHabits

matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)

where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object

variables.

Classification

The syntax for Classification is −

mine classification [as pattern_name]

analyze classifying_attribute_or_dimension

For example, to mine patterns, classifying customer credit rating where the classes are determined

by the attribute credit_rating, and mine classification is determined as classifyCustomerCreditRating.

analyze credit_rating

Prediction

The syntax for prediction is −

mine prediction [as pattern_name]

analyze prediction_attribute_or_dimension

{set {attribute_or_dimension_i= value_i}}

Syntax for Concept Hierarchy Specification

To specify concept hierarchies, use the following syntax −

use hierarchy for <attribute_or_dimension>

We use different syntaxes to define different types of hierarchies such as−

-schema hierarchies

define hierarchy time_hierarchy on date as [date,month quarter,year]

set-grouping hierarchies

define hierarchy age_hierarchy for age on customer as

level1: {young, middle_aged, senior} < level0: all

level2: {20, ..., 39} < level1: young

level3: {40, ..., 59} < level1: middle_aged

level4: {60, ..., 89} < level1: senior

-operation-derived hierarchies

define hierarchy age_hierarchy for age on customer as

{age_category(1), ..., age_category(5)}

:= cluster(default, age, 5) < all(age)

-rule-based hierarchies

define hierarchy profit_margin_hierarchy on item as

level_1: low_profit_margin < level_0: all

if (price - cost)< $

level_1: medium-profit_margin < level_0: all

if ((price - cost) > $50) and ((price - cost) ≤ $250))

level_1: high_profit_margin < level_0: all

Data Mining - Classification & Prediction

There are two forms of data analysis that can be used for extracting models describing important

classes or to predict future data trends. These two forms are as follows −

 Classification  Prediction

Classification models predict categorical class labels; and prediction models predict continuous

valued functions. For example, we can build a classification model to categorize bank loan

applications as either safe or risky, or a prediction model to predict the expenditures in dollars of

potential customers on computer equipment given their income and occupation.

What is classification?

Following are the examples of cases where the data analysis task is Classification −

 A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe.

 A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer.

In both of the above examples, a model or classifier is constructed to predict the categorical labels.

These labels are risky or safe for loan application data and yes or no for marketing data.

What is prediction?

Following are the examples of cases where the data analysis task is Prediction −

Suppose the marketing manager needs to predict how much a given customer will spend during a

sale at his company. In this example we are bothered to predict a numeric value. Therefore the data

analysis task is an example of numeric prediction. In this case, a model or a predictor will be

constructed that predicts a continuous-valued-function or ordered value.

Note − Regression analysis is a statistical methodology that is most often used for numeric

prediction.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us understand the

working of classification. The Data Classification process includes two steps −

 Building the Classifier or Model  Using Classifier for Classification

Building the Classifier or Model

 This step is the learning step or the learning phase.

 In this step the classification algorithms build the classifier.

 The classifier is built from the training set made up of database tuples and their associated class labels.

 Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points.

Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy

of classification rules. The classification rules can be applied to the new data tuples if the accuracy

is considered acceptable.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves

the following activities −

Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute.

Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.

Data Transformation and reduction − The data can be transformed by any of the following methods.

o Normalization − The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used.

o Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies.

Note − Data can also be reduced by some other methods such as wavelet transformation, binning,

histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −

Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data.

Speed − This refers to the computational cost in generating and using the classifier or predictor.

Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data.

Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data.

Interpretability − It refers to what extent the classifier or predictor understands.