
























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
All the important questions cover in this notes
Typology: Lecture notes
1 / 32
This page cannot be seen from the preview
Don't miss anything!

























A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.
Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.
Scalability − We need highly scalable clustering algorithms to deal with large databases.
Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
Clustering Methods
Partitioning Method Hierarchical Method Density-based Method Grid-Based Method Model-Based Method Constraint-based Method
Partitioning Method
Each group contains at least one object.
Each object must belong to exactly one group.
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.
Hierarchical Methods
Agglomerative Approach Divisive Approach
Agglomerative Approach
Divisive Approach
Audience
Prerequisites
What is Data Mining?
Market Analysis Fraud Detection Customer Retention Production Control Science Exploration
Data Mining Applications
Market Analysis and Management Corporate Analysis & Risk Management Fraud Detection
Market Analysis and Management
Customer Profiling − Data mining helps determine what kind of people buy what kind of products.
Identifying Customer Requirements − Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers.
Cross Market Analysis − Data mining performs Association/correlations between product sales.
Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.
Providing Summary Information − Data mining provides us various multidimensional summary reports.
Corporate Analysis and Risk Management
Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and spending.
Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data Mining - Tasks
Descriptive Classification and Prediction
Descriptive Function
Class/Concept Description Mining of Frequent Patterns Mining of Associations Mining of Correlations Mining of Clusters
Class/Concept Description
Mathematical Formulae Neural Networks
Classification − It predicts the class of objects whose class label is unknown. Its objective is to find a derived model that describes and distinguishes data classes or concepts. The Derived Model is based on the analysis set of training data i.e. the data object whose class label is well known.
Prediction − It is used to predict missing or unavailable numerical data values rather than class labels. Regression Analysis is generally used for prediction. Prediction can also be used for identification of distribution trends based on available data.
Outlier Analysis − Outliers may be defined as the data objects that do not comply with the general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and model regularities or trends for objects whose behavior changes over time.
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query. This query is input to the system. A data mining query is defined in terms of data mining task primitives.
Set of task relevant data to be mined. Kind of knowledge to be mined. Background knowledge to be used in discovery process. Interestingness measures and thresholds for pattern evaluation. Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
Database Attributes Data Warehouse dimensions of interest
Kind of knowledge to be mined
Characterization Discrimination Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis
Evolution Analysis
Background knowledge
Interestingness measures and thresholds for pattern evaluation
Representation for visualizing the discovered patterns
Rules Tables Charts Graphs Decision Trees Cubes
Data Mining - Issues
Mining Methodology and User Interaction Performance Issues Diverse Data Types Issues
Mining Methodology and User Interaction Issues
Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
suppliers, sales, revenue, etc. The data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision-making.
Integrated − Data warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhances the effective analysis of data.
Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from a historical point of view.
Non-volatile − Nonvolatile means the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in the data warehouse.
Data Warehousing
Query Driven Approach Update Driven Approach
Query-Driven Approach
Process of Query Driven Approach
When a query is issued to a client side, a metadata dictionary translates the query into the queries, appropriate for the individual heterogeneous site involved.
Now these queries are mapped and sent to the local query processor.
The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
The Query Driven Approach needs complex integration and filtering processes.
It is very inefficient and very expensive for frequent queries.
This approach is expensive for queries that require aggregations.
Update-Driven Approach
Advantages
This approach provides high performance.
The data can be copied, processed, integrated, annotated, summarized and restructured in the semantic data store in advance.
From Data Warehousing (OLAP) to Data Mining (OLAM)
Importance of OLAM
High quality of data in data warehouses − The data mining tools are required to work on integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of data. The data warehouses constructed by such preprocessing are valuable sources of high quality data for OLAP and data mining as well.
Available information processing infrastructure surrounding data warehouses − Information processing infrastructure refers to accessing, integration, consolidation, and transformation of multiple heterogeneous databases, web-accessing and service facilities, reporting and OLAP analysis tools.
OLAP−based exploratory data analysis − Exploratory data analysis is required for effective data mining. OLAM provides facility for data mining on various subset of data and at different levels of abstraction.
Online selection of data mining functions − Integrating OLAP with multiple data mining functions and online analytical mining provide users with the flexibility to select desired data mining functions and swap data mining tasks dynamically.
Data Mining - Terminologies
Data Mining
Market Analysis
Evaluate mined patterns. Visualize the patterns in different forms.
Data Integration
Data Cleaning
Data Selection
Clusters
Data Transformation
Data Mining - Systems
Spatial Data Analysis Information Retrieval Pattern Recognition Image Analysis Signal Processing Computer Graphics Web Technology Business Bioinformatics
Data Mining System Classification
Database Technology Statistics Machine Learning Information Science Visualization Other Disciplines
Classification Based on the Databases Mined
Classification Based on the kind of Knowledge Mined
Characterization Discrimination Association and Correlation Analysis Classification Prediction Outlier Analysis Evolution Analysis
Classification Based on the Techniques Utilized
Classification Based on the Applications Adapted
Syntax for Specifying the Kind of Knowledge
Characterization
Discrimination
Association
Classification
Prediction
Syntax for Concept Hierarchy Specification
Data Mining - Classification & Prediction
Classification Prediction
What is classification?
A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer.
What is prediction?
How Does Classification Works?
Building the Classifier or Model Using Classifier for Classification
Building the Classifier or Model
This step is the learning step or the learning phase.
In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and their associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
Classification and Prediction Issues
Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with most commonly occurring value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by any of the following methods.
o Normalization − The data is transformed using normalization. Normalization involves scaling all values for given attribute in order to make them fall within a small specified range. Normalization is used when in the learning step, the neural networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we can use the concept hierarchies.
Comparison of Classification and Prediction Methods
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data.
Speed − This refers to the computational cost in generating and using the classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor understands.