Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Mining Concepts: Q&A on Data Preprocessing and Analysis, Exams of Advanced Education

Yale University Advanced Education

A comprehensive overview of data mining concepts, focusing on data preprocessing techniques. It covers central tendencies, oltp and olap, and various olap operations. It also delves into data cleaning, transformation, and reduction methods, including handling noisy, incomplete, and inconsistent data. Data mining tasks such as concept description, association analysis, classification, clustering, outlier detection, and evolution analysis. It explains classification methods, decision trees, bayesian theorem, and clustering requirements, offering a structured approach to understanding data mining principles and practices. Useful for students and professionals seeking to understand the fundamentals of data mining and data analysis. (438 characters)

Typology: Exams

2024/2025

Available from 08/10/2025

wamunyorojj 🇺🇸

339 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

Data Mining test

What are the 3 central tendencies - answerMean, Medium, Mode

Define Desperation - answerdegree to which numerical data tends to spread

What does OLTP and OLAP and define - answerOnline Transaction Processing - day to

day transaction

Online Analytical Processing - tool provided by data warehouse management system

for data analysis and decision making

What are 4 OLAP operates and what do they do - answerDrill up - summarises data

Drill down - from summary to more detailed data

Slice & Dice - project & select

Pivot - re-orientate cube

Difference between data mining and OLAP - answerOLAP - tells whats happening to

data

DM - tells whats happening to data, why its happening and can predict future

What the purpose of pre-processing - answerTo deal with

-Noisy data

-Incomplete data

-Inconsistent data

What is noisy data and what causes it - answerRandom errors, outliers

faulty data collection equipment, human/computer input error

What is incomplete data and what causes it - answerData missing attributes or missing

attributes of interest

unknown or not considered important at data collection, equipment malfunction

What is inconsistent data and what causes it - answerData containing discrepancies in

naming convention, data domain or format

-different sources

-changes in strategy of data warehousing

-functional dependencies violation

Purpose of data cleaning and what it involves (4) - answerremoving noisy data &

correcting inconsistencies

-filling in missing values

-smoothing noisy data

Discover Exams of Advanced Education Yale University

Partial preview of the text

Download Data Mining Concepts: Q&A on Data Preprocessing and Analysis and more Exams Advanced Education in PDF only on Docsity!

Data Mining test

What are the 3 central tendencies - answerMean, Medium, Mode Define Desperation - answerdegree to which numerical data tends to spread What does OLTP and OLAP and define - answerOnline Transaction Processing - day to day transaction Online Analytical Processing - tool provided by data warehouse management system for data analysis and decision making What are 4 OLAP operates and what do they do - answerDrill up - summarises data Drill down - from summary to more detailed data Slice & Dice - project & select Pivot - re-orientate cube Difference between data mining and OLAP - answerOLAP - tells whats happening to data DM - tells whats happening to data, why its happening and can predict future What the purpose of pre-processing - answerTo deal with -Noisy data -Incomplete data -Inconsistent data What is noisy data and what causes it - answerRandom errors, outliers faulty data collection equipment, human/computer input error What is incomplete data and what causes it - answerData missing attributes or missing attributes of interest unknown or not considered important at data collection, equipment malfunction What is inconsistent data and what causes it - answerData containing discrepancies in naming convention, data domain or format -different sources -changes in strategy of data warehousing -functional dependencies violation Purpose of data cleaning and what it involves (4) - answerremoving noisy data & correcting inconsistencies -filling in missing values -smoothing noisy data

-identifying and removing outliers -resolving inconsistencies What are the methods for dealing with missing values, give Adv/Dis - answerIgnore tuples with missing values A-small effect, if no. of affect tuples is small D-large effect, if no. of affect tuples is large (causing biased) Manually filling in missing values D-Time consuming/error prone Replace missing value with constant A-simple, smaller bias than ignoring D-noise may be introduced Predict missing value with model A- most effective D-complex, computationally expensive What is the method of Z-score Normalisation - answerMaps new dataset with mean & standard deviation (helps detect outliers) V' = V-mean/standard deviation What is the formula to calculate a new value in Min-Mix Normalisation - answer(v-min)/(max-min)x(newMax-newMin)+newMin Define the 4 methods of dealing with noise - answerBinning - smoothing using values around it Regression - smoothing using regression function Clustering - detect and remove outliers Combined human & computer inspection What are the 3 steps of binning - answer-Sort dataset -Partition data into equal buckets (equal depth = freq, equal width = distance) -Replace each data in bin with appropriate value (normally mean) What are the issues with data integration (3), if done correctly (2) - answer-schema integration -entity identification problem (soccer/football) -data value conflicts (miles/km) -helps reduce/avoid redundancies & inconsistencies -Improve mining speed and quality Define 4 methods of data transformation - answerSmoothing - removing noise from data

What is the Apriori algorithm - answer"All nonempty subsets of a frequent itemset must also be frequent" Whats the advantages Apriori algorithm - answerReduces complexity of mining association rules, by reducing the number of -Candidates -Transactions -Comparisons in the database Increase algorithm efficiency What are the issues with threshold (Smin, Cmin) - answerSmin -If set too high, rare itemsets may be lost -If set too low, many valid itemsets do not frequently occur Cmin -Too high, only a few rules may be found -Too low, many rules are uncertain Why are inexplicable and redundant rules no interesting - answerInexplicable - no actionable Redundant - don't meet threshold Whats the equation for lift - answerC(A=>B)/S(B) Define the 3 types of nodes in a Decision trees - answer-One root node - first attribute selected to group samples (e.g. Age) -Intermediate nodes - other attributes selected to further divide samples -Leaf nodes - resulting class labels (e.g. Buy_computer) What are the advantages of Decision trees - answerEasy to classify a new sample Learning speed is faster than other methods Accuracy is comparable to other methods Robust to noise in dataset Copes with both norminal and numerical data Easily converted into a set of classification rules that are simple and easy to understand In decision tree when does partitioning stop - answer•All samples for a given node belong to the same class •There are no more attributes for further partitioning •There are no samples left What are the 2 algorithms used to determine the attribute for splitting (and how do they determine) - answerID3 algorithm - selecting the attribute with the highest info gain C4.5 algorithm - selecting the attribute with the highest info gain ratio

How is the info expected calulated - answerno. of tuples in the class/no. of tuples all together How is the expected info (entropy) calculated - answer-((t/t+f log2 X t/t+f )+( f/t+f log2 X f/t+f)) where t = no. of true tuples, f = no of false How to calculate the information needed - answer(A1/N x I1 + A2/N x I2 ....... + An/N x In) Where A = no. of instances in group, I = info expected of group, N = total instances How is the info gain calculated - answerinfo expected - info needed How is split info calculated - answer- A1/N log2(A1/N) - A2/N log2(A2/N)................. - An/N log2(An/N) Where A = no. of instances in group, N = total instances How is the info gain ratio calculated - answerinfo gain/ split info Whats the difference between info gain and info gain ratio - answerinfo gain - biased towards multi valued attributes and May lead to overfitting info gain ration - Tends to prefer unbalanced splits in which one partition is much smaller than the others What is Occam's Razor - answer"Given two models with the same testing error, the simpler one is preferred." What is overfitting and what causes it - answerToo many branches Causes •Small number of representative samples in the training data •Noisy data or outliers •Model complexity Aims of tree pruning - answerAvoid overfitting Reduces model complexity Easier to understand What are the 2 different methods of decision tree evaluation - answerPartition (holdout) Cross Validation

What are some requirements of clustering - answerScalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of cluster with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability What is a distance function used for - answerTo measure similarity between two instances What are two popular distance functions - answer•Manhattan distance •Euclidean distance What are some features of ordinal variables (3) - answerEither discrete/continous Order is important Can be treated like interval-scaled For nominal variables how can the distance be calculated - answerV - m / V V = no. of variables m = number of matches What is a centroid - answerThe central point (mean point) of a cluster Define the two popular cluster partitioning methods - answerK-means - each cluster represented by centroid K-medoids - each cluster is represented by one of the points in the cluster What are the steps of K-means clustering - answer1.Randomly select K points as the initial centroids 2.Loop (steps 3 & 4) 3.Assign each point to the nearest centroid to form a cluster 4.Compute the centroid of each cluster 5.Stop if there is no more new assignment, i.e. all centroids do not change any more Advantages of K-means clustering - answerEasy to understand Relatively efficient Scalable Disadvantages of K-means clustering - answerApplicable only to numerical data Need to specify no. of clusters in advance Sensitive to noisy data and outliers

(remember the above, its easier) Unable to handle cluster of different sizes and densities Unclear as which attributes are more important Lack of explanation about the nature of the clusters discovered What is the disadvantage and advantage of K-medoids clustering - answerMore robust in presence of noise and outliers More costly, suitable for small datasets {not scalable} Give two examples of when outliers could be useful - answer-Credit card fraud detection -Medical analysis To identify anomalies we consider (2) - answerNo. of attributes used to define anomaly Whether its an anomaly globally or locally How does a statistical detection method approach work (2 Dis) - answerAssumes a distribution or probability model for a given dataset and then identifies outliers with respects to model -Most tests are for single value -Data distribution may be unknown How does a distance-based detection method approach work (1Adv, 1Dis) - answerAnomalies are that which are distant from the other objects -Enables multi dimensional analysis without knowing data distribution -Cannot handle data with uneven densities How does a density-based detection method approach work (1Adv, 1Dis) and define LOF - answerObject is an outlier if its density is relatively much lower than that of its neighbours {local outliers} -LOF useful for discovering local outlier -Costly to compute every LOF for each object Local Outlier Factor = the degree to which the object is isolated with respect to its neighbors How does a cluster-based detection method approach work (1Adv, 1Dis) - answerObjects is an outlier if: -Does not belong in any cluster -Large distance between the object and closest cluster -Belongs to small/sparse cluster

Data Mining Concepts: Q&A on Data Preprocessing and Analysis, Exams of Advanced Education

Related documents

Partial preview of the text

Download Data Mining Concepts: Q&A on Data Preprocessing and Analysis and more Exams Advanced Education in PDF only on Docsity!

Data Mining test