Data Mining Concepts: Q&A on Data Preprocessing and Analysis, Exams of Advanced Education

A comprehensive overview of data mining concepts, focusing on data preprocessing techniques. It covers central tendencies, oltp and olap, and various olap operations. It also delves into data cleaning, transformation, and reduction methods, including handling noisy, incomplete, and inconsistent data. Data mining tasks such as concept description, association analysis, classification, clustering, outlier detection, and evolution analysis. It explains classification methods, decision trees, bayesian theorem, and clustering requirements, offering a structured approach to understanding data mining principles and practices. Useful for students and professionals seeking to understand the fundamentals of data mining and data analysis. (438 characters)

Typology: Exams

2024/2025

Available from 08/10/2025

wamunyorojj
wamunyorojj 🇺🇸

339 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining test
What are the 3 central tendencies - answerMean, Medium, Mode
Define Desperation - answerdegree to which numerical data tends to spread
What does OLTP and OLAP and define - answerOnline Transaction Processing - day to
day transaction
Online Analytical Processing - tool provided by data warehouse management system
for data analysis and decision making
What are 4 OLAP operates and what do they do - answerDrill up - summarises data
Drill down - from summary to more detailed data
Slice & Dice - project & select
Pivot - re-orientate cube
Difference between data mining and OLAP - answerOLAP - tells whats happening to
data
DM - tells whats happening to data, why its happening and can predict future
What the purpose of pre-processing - answerTo deal with
-Noisy data
-Incomplete data
-Inconsistent data
What is noisy data and what causes it - answerRandom errors, outliers
faulty data collection equipment, human/computer input error
What is incomplete data and what causes it - answerData missing attributes or missing
attributes of interest
unknown or not considered important at data collection, equipment malfunction
What is inconsistent data and what causes it - answerData containing discrepancies in
naming convention, data domain or format
-different sources
-changes in strategy of data warehousing
-functional dependencies violation
Purpose of data cleaning and what it involves (4) - answerremoving noisy data &
correcting inconsistencies
-filling in missing values
-smoothing noisy data
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Data Mining Concepts: Q&A on Data Preprocessing and Analysis and more Exams Advanced Education in PDF only on Docsity!

Data Mining test

What are the 3 central tendencies - answerMean, Medium, Mode Define Desperation - answerdegree to which numerical data tends to spread What does OLTP and OLAP and define - answerOnline Transaction Processing - day to day transaction Online Analytical Processing - tool provided by data warehouse management system for data analysis and decision making What are 4 OLAP operates and what do they do - answerDrill up - summarises data Drill down - from summary to more detailed data Slice & Dice - project & select Pivot - re-orientate cube Difference between data mining and OLAP - answerOLAP - tells whats happening to data DM - tells whats happening to data, why its happening and can predict future What the purpose of pre-processing - answerTo deal with -Noisy data -Incomplete data -Inconsistent data What is noisy data and what causes it - answerRandom errors, outliers faulty data collection equipment, human/computer input error What is incomplete data and what causes it - answerData missing attributes or missing attributes of interest unknown or not considered important at data collection, equipment malfunction What is inconsistent data and what causes it - answerData containing discrepancies in naming convention, data domain or format -different sources -changes in strategy of data warehousing -functional dependencies violation Purpose of data cleaning and what it involves (4) - answerremoving noisy data & correcting inconsistencies -filling in missing values -smoothing noisy data

-identifying and removing outliers -resolving inconsistencies What are the methods for dealing with missing values, give Adv/Dis - answerIgnore tuples with missing values A-small effect, if no. of affect tuples is small D-large effect, if no. of affect tuples is large (causing biased) Manually filling in missing values D-Time consuming/error prone Replace missing value with constant A-simple, smaller bias than ignoring D-noise may be introduced Predict missing value with model A- most effective D-complex, computationally expensive What is the method of Z-score Normalisation - answerMaps new dataset with mean & standard deviation (helps detect outliers) V' = V-mean/standard deviation What is the formula to calculate a new value in Min-Mix Normalisation - answer(v-min)/(max-min)x(newMax-newMin)+newMin Define the 4 methods of dealing with noise - answerBinning - smoothing using values around it Regression - smoothing using regression function Clustering - detect and remove outliers Combined human & computer inspection What are the 3 steps of binning - answer-Sort dataset -Partition data into equal buckets (equal depth = freq, equal width = distance) -Replace each data in bin with appropriate value (normally mean) What are the issues with data integration (3), if done correctly (2) - answer-schema integration -entity identification problem (soccer/football) -data value conflicts (miles/km) -helps reduce/avoid redundancies & inconsistencies -Improve mining speed and quality Define 4 methods of data transformation - answerSmoothing - removing noise from data

What is the Apriori algorithm - answer"All nonempty subsets of a frequent itemset must also be frequent" Whats the advantages Apriori algorithm - answerReduces complexity of mining association rules, by reducing the number of -Candidates -Transactions -Comparisons in the database Increase algorithm efficiency What are the issues with threshold (Smin, Cmin) - answerSmin -If set too high, rare itemsets may be lost -If set too low, many valid itemsets do not frequently occur Cmin -Too high, only a few rules may be found -Too low, many rules are uncertain Why are inexplicable and redundant rules no interesting - answerInexplicable - no actionable Redundant - don't meet threshold Whats the equation for lift - answerC(A=>B)/S(B) Define the 3 types of nodes in a Decision trees - answer-One root node - first attribute selected to group samples (e.g. Age) -Intermediate nodes - other attributes selected to further divide samples -Leaf nodes - resulting class labels (e.g. Buy_computer) What are the advantages of Decision trees - answerEasy to classify a new sample Learning speed is faster than other methods Accuracy is comparable to other methods Robust to noise in dataset Copes with both norminal and numerical data Easily converted into a set of classification rules that are simple and easy to understand In decision tree when does partitioning stop - answer•All samples for a given node belong to the same class •There are no more attributes for further partitioning •There are no samples left What are the 2 algorithms used to determine the attribute for splitting (and how do they determine) - answerID3 algorithm - selecting the attribute with the highest info gain C4.5 algorithm - selecting the attribute with the highest info gain ratio

How is the info expected calulated - answerno. of tuples in the class/no. of tuples all together How is the expected info (entropy) calculated - answer-((t/t+f log2 X t/t+f )+( f/t+f log2 X f/t+f)) where t = no. of true tuples, f = no of false How to calculate the information needed - answer(A1/N x I1 + A2/N x I2 ....... + An/N x In) Where A = no. of instances in group, I = info expected of group, N = total instances How is the info gain calculated - answerinfo expected - info needed How is split info calculated - answer- A1/N log2(A1/N) - A2/N log2(A2/N)................. - An/N log2(An/N) Where A = no. of instances in group, N = total instances How is the info gain ratio calculated - answerinfo gain/ split info Whats the difference between info gain and info gain ratio - answerinfo gain - biased towards multi valued attributes and May lead to overfitting info gain ration - Tends to prefer unbalanced splits in which one partition is much smaller than the others What is Occam's Razor - answer"Given two models with the same testing error, the simpler one is preferred." What is overfitting and what causes it - answerToo many branches Causes •Small number of representative samples in the training data •Noisy data or outliers •Model complexity Aims of tree pruning - answerAvoid overfitting Reduces model complexity Easier to understand What are the 2 different methods of decision tree evaluation - answerPartition (holdout) Cross Validation

What are some requirements of clustering - answerScalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of cluster with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability What is a distance function used for - answerTo measure similarity between two instances What are two popular distance functions - answer•Manhattan distance •Euclidean distance What are some features of ordinal variables (3) - answerEither discrete/continous Order is important Can be treated like interval-scaled For nominal variables how can the distance be calculated - answerV - m / V V = no. of variables m = number of matches What is a centroid - answerThe central point (mean point) of a cluster Define the two popular cluster partitioning methods - answerK-means - each cluster represented by centroid K-medoids - each cluster is represented by one of the points in the cluster What are the steps of K-means clustering - answer1.Randomly select K points as the initial centroids 2.Loop (steps 3 & 4) 3.Assign each point to the nearest centroid to form a cluster 4.Compute the centroid of each cluster 5.Stop if there is no more new assignment, i.e. all centroids do not change any more Advantages of K-means clustering - answerEasy to understand Relatively efficient Scalable Disadvantages of K-means clustering - answerApplicable only to numerical data Need to specify no. of clusters in advance Sensitive to noisy data and outliers

(remember the above, its easier) Unable to handle cluster of different sizes and densities Unclear as which attributes are more important Lack of explanation about the nature of the clusters discovered What is the disadvantage and advantage of K-medoids clustering - answerMore robust in presence of noise and outliers More costly, suitable for small datasets {not scalable} Give two examples of when outliers could be useful - answer-Credit card fraud detection -Medical analysis To identify anomalies we consider (2) - answerNo. of attributes used to define anomaly Whether its an anomaly globally or locally How does a statistical detection method approach work (2 Dis) - answerAssumes a distribution or probability model for a given dataset and then identifies outliers with respects to model -Most tests are for single value -Data distribution may be unknown How does a distance-based detection method approach work (1Adv, 1Dis) - answerAnomalies are that which are distant from the other objects -Enables multi dimensional analysis without knowing data distribution -Cannot handle data with uneven densities How does a density-based detection method approach work (1Adv, 1Dis) and define LOF - answerObject is an outlier if its density is relatively much lower than that of its neighbours {local outliers} -LOF useful for discovering local outlier -Costly to compute every LOF for each object Local Outlier Factor = the degree to which the object is isolated with respect to its neighbors How does a cluster-based detection method approach work (1Adv, 1Dis) - answerObjects is an outlier if: -Does not belong in any cluster -Large distance between the object and closest cluster -Belongs to small/sparse cluster