







































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Data Mining Ultimate Exam is a comprehensive study and practice resource designed for students, analysts, and data professionals seeking to master the principles of data mining, predictive analytics, and knowledge discovery. This exam covers essential topics such as classification, clustering, association rules, regression analysis, data preprocessing, visualization, machine learning foundations, and big data concepts. Learners will strengthen their ability to analyze complex datasets, identify meaningful patterns, and apply statistical techniques to real-world business and research problems. The Ultimate Exam includes realistic practice questions, scenario-based exercises, and detailed explanations to help candidates build confidence for academic assessments, professional certifications, and data-driven careers.
Typology: Exams
1 / 47
This page cannot be seen from the preview
Don't miss anything!








































Question 1. Which phase of the KDD process involves transforming raw data into a suitable format for mining? A) Data selection B) Data cleaning C) Data transformation D) Pattern evaluation Answer: C Explanation: Data transformation converts cleaned data into appropriate forms (e.g., normalization, aggregation) before mining algorithms are applied. Question 2. In a multi-tier data-warehouse architecture, which layer is primarily responsible for executing OLAP queries? A) Data source layer B) Data-warehouse layer C) OLAP engine layer D) Front-end layer Answer: C Explanation: The OLAP engine interprets and processes multidimensional queries, generating results for the front-end. Question 3. What distinguishes a data mart from an enterprise data warehouse? A) Data mart stores only transactional data. B) Data mart is oriented to a specific business line, while an enterprise warehouse integrates data across the whole organization. C) Data mart uses ROLAP exclusively. D) Data mart cannot support drill-down operations. Answer: B Explanation: Data marts are subject-area subsets of an enterprise warehouse, designed for departmental use. Question 4. Which OLAP operation creates a sub-cube by fixing one dimension to a single value?
A) Drill-down B) Roll-up C) Slice D) Pivot Answer: C Explanation: Slicing selects a single value for one dimension, producing a lower-dimensional view. Question 5. In a MOLAP system, the data is stored as: A) Relational tables B) Multidimensional arrays C) Flat files D) Key-value pairs Answer: B Explanation: MOLAP pre-aggregates data in multidimensional cubes, enabling fast query performance. Question 6. Which schema design reduces redundancy by normalizing dimension tables? A) Star schema B) Snowflake schema C) Fact constellation D) Galaxy schema Answer: B Explanation: Snowflake schema normalizes dimensions into multiple related tables, decreasing redundancy. Question 7. Which imputation technique replaces missing numeric values with the median of the attribute? A) Hot-deck imputation B) Mean substitution
Question 11. Which feature-selection method evaluates subsets of attributes using a predictive model’s performance? A) Filter method B) Wrapper method C) Embedded method D) Correlation-based method Answer: B Explanation: Wrapper methods search over attribute subsets and assess them with a learning algorithm, directly measuring predictive accuracy. Question 12. Principal Component Analysis (PCA) seeks to: A) Maximize class separation. B) Minimize reconstruction error while reducing dimensionality. C) Preserve original feature meanings. D) Increase the number of features. Answer: B Explanation: PCA finds orthogonal components that capture maximal variance, reducing dimensions while retaining most information. Question 13. Which descriptive statistic is most affected by extreme outliers? A) Mean B) Median C) Mode D) Interquartile range Answer: A Explanation: The mean incorporates all values and shifts substantially when outliers are present. Question 14. A box plot displays all the following EXCEPT: A) Median
B) Mean C) Quartiles D) Outliers Answer: B Explanation: Traditional box plots show median, quartiles, and potential outliers, but not the mean. Question 15. In the Apriori algorithm, the “downward closure property” states that: A) If an itemset is frequent, all its supersets are also frequent. B) If an itemset is infrequent, all its supersets are infrequent. C) The support of an itemset increases with size. D) Confidence is monotonic with respect to itemset size. Answer: B Explanation: The property allows pruning of candidate supersets once a subset is found infrequent. Question 16. Which data structure is central to the FP-Growth algorithm? A) Hash table B) Trie C) Frequent pattern tree (FP-tree) D) Adjacency matrix Answer: C Explanation: FP-Growth builds a compact FP-tree to mine frequent patterns without candidate generation. Question 17. For an association rule X → Y, a lift value greater than 1 indicates: A) X and Y are independent. B) X negatively influences Y. C) X and Y occur together more often than expected by chance. D) The rule has low confidence.
Question 21. Naïve Bayes assumes that: A) All attributes are continuous. B) Attributes are conditionally independent given the class. C) Classes have equal prior probabilities. D) The decision boundary is linear. Answer: B Explanation: The naïve independence assumption simplifies probability calculations. Question 22. Which of the following is a characteristic of a Bayesian belief network? A) It requires no prior probabilities. B) Nodes represent random variables and edges encode conditional dependencies. C) It can only model binary variables. D) Inference is deterministic. Answer: B Explanation: A belief network is a directed acyclic graph where edges denote probabilistic dependencies. Question 23. The kernel trick in SVM allows: A) Training on categorical data without encoding. B) Solving non-linear classification problems by implicitly mapping data to a higher-dimensional space. C) Reducing the number of support vectors. D) Performing feature selection automatically. Answer: B Explanation: Kernels compute inner products in transformed space without explicit mapping. Question 24. In k-Nearest Neighbors, the choice of distance metric most directly influences:
A) Model interpretability. B) Computational complexity of training. C) Classification boundaries. D) Number of support vectors. Answer: C Explanation: The distance metric defines similarity, shaping the decision regions. Question 25. Which activation function suffers from the “vanishing gradient” problem in deep networks? A) ReLU B) Sigmoid C) Leaky ReLU D) Softmax Answer: B Explanation: Sigmoid saturates for large magnitude inputs, causing gradients near zero. Question 26. Random Forests improve over a single decision tree primarily by: A) Using boosting to re-weight misclassified instances. B) Averaging predictions of many de-correlated trees built on bootstrapped samples and random feature subsets. C) Pruning each tree to the same depth. D) Applying gradient descent on tree parameters. Answer: B Explanation: Bagging and random feature selection reduce variance and prevent overfitting. Question 27. AdaBoost focuses on reducing: A) Model bias only. B) Model variance only. C) Both bias and variance by iteratively re-weighting misclassified instances.
Question 31. The primary objective of the k-means algorithm is to: A) Maximize inter-cluster distance. B) Minimize the sum of squared distances between points and their assigned cluster centroids. C) Produce clusters of equal size. D) Preserve hierarchical relationships. Answer: B Explanation: k-means iteratively updates centroids to reduce within-cluster variance. Question 32. Which method is commonly used to determine the optimal number of clusters in k-means? A) Silhouette analysis B) Chi-square test C) Pearson correlation D) Gini impurity Answer: A Explanation: Silhouette scores evaluate cohesion and separation for different k values. Question 33. In hierarchical agglomerative clustering, the “single-link” method defines the distance between two clusters as: A) Average distance between all pairs. B) Maximum distance between any pair. C) Minimum distance between any pair. D) Distance between cluster centroids. Answer: C Explanation: Single-link (nearest-neighbor) uses the smallest pairwise distance, which can cause chaining effects. Question 34. DBSCAN classifies points as “core”, “border”, or “noise” based on: A) Number of nearest neighbors within ε and a minimum points threshold.
B) Distance to the global centroid. C) Membership in a predefined grid. D) Hierarchical tree depth. Answer: A Explanation: Core points have ≥ MinPts neighbors within ε; border points are reachable from a core point; others are noise. Question 35. The Silhouette coefficient for a data point is computed as (b – a) / max(a, b). What do “a” and “b” represent? A) a = average intra-cluster distance, b = average nearest-cluster distance. B) a = distance to centroid, b = distance to farthest point. C) a = density, b = sparsity. D) a = number of neighbors, b = number of outliers. Answer: A Explanation: “a” measures cohesion, “b” measures separation; the coefficient ranges from –1 to 1. Question 36. Which outlier-detection technique relies on the concept of local density comparison? A) Z-score B) Mahalanobis distance C) Local Outlier Factor (LOF) D) Grubbs’ test Answer: C Explanation: LOF compares the density around a point to that of its neighbors to identify anomalies. Question 37. In text mining, TF-IDF weighting increases the importance of a term that is: A) Frequently occurring in the entire corpus. B) Rare across documents but frequent within a specific document. C) Present in every document.
Question 41. Differential privacy provides privacy guarantees by: A) Removing all personally identifiable information. B) Adding calibrated random noise to query results. C) Encrypting the entire dataset. D) Limiting data access to a single analyst. Answer: B Explanation: The added noise ensures that the presence or absence of any individual has a limited impact on outputs. Question 42. Under GDPR, “right to be forgotten” obligates data controllers to: A) Anonymize data permanently. B) Delete personal data upon a valid request, unless exemptions apply. C) Store data for at least 10 years. D) Share data with third parties. Answer: B Explanation: The regulation grants individuals the ability to request erasure of their personal data. Question 43. Which technique can mitigate algorithmic bias caused by imbalanced class distributions? A) Using only accuracy as evaluation metric. B) Oversampling the minority class (e.g., SMOTE). C) Removing all minority instances. D) Ignoring class labels during training. Answer: B Explanation: Synthetic Minority Over-sampling Technique (SMOTE) generates new minority samples to balance the training set. Question 44. Hadoop’s MapReduce paradigm consists of which two main functions? A) Sort and Join
B) Map and Reduce C) Filter and Aggregate D) Load and Store Answer: B Explanation: “Map” processes input key-value pairs, and “Reduce” aggregates intermediate results. Question 45. Spark improves over Hadoop MapReduce primarily by: A) Using disk-based storage for all operations. B) Providing in-memory processing via Resilient Distributed Datasets (RDDs). C) Requiring only a single node. D) Eliminating the need for a scheduler. Answer: B Explanation: Spark’s RDDs keep data in memory across iterations, drastically speeding up iterative algorithms. Question 46. Which NoSQL data model is best suited for storing graph-structured data? A) Column-family store B) Document store C) Key-value store D) Graph database Answer: D Explanation: Graph databases (e.g., Neo4j) are optimized for nodes and edges, enabling efficient traversals. Question 47. AutoML platforms typically automate all EXCEPT: A) Hyperparameter tuning B) Feature engineering C) Model deployment to production environments D) Selection of the learning algorithm
Question 51. In a support vector machine with a radial basis function (RBF) kernel, the parameter γ controls: A) The margin width. B) The degree of the polynomial kernel. C) The influence of a single training example. D) The number of support vectors. Answer: C Explanation: γ determines the spread of the RBF; higher γ leads to tighter influence around each point. Question 52. Which evaluation metric is most appropriate for imbalanced binary classification when the cost of false negatives is high? A) Accuracy B) Precision C) Recall D) Specificity Answer: C Explanation: Recall emphasizes correctly identifying positive cases, minimizing false negatives. Question 53. In hierarchical clustering, the “complete-link” method defines inter-cluster distance as: A) Minimum pairwise distance. B) Average pairwise distance. C) Maximum pairwise distance. D) Distance between centroids. Answer: C Explanation: Complete-link uses the farthest points, producing compact, spherical clusters.
Question 54. Which of the following is a key advantage of using the “silhouette” method over the “elbow” method for cluster validation? A) It does not require a distance metric. B) It provides a per-point measure of clustering quality. C) It works only with density-based algorithms. D) It automatically determines the optimal number of clusters. Answer: B Explanation: Silhouette scores evaluate each point’s cohesion and separation, offering detailed insight. Question 55. In DBSCAN, decreasing ε while keeping MinPts constant will generally: A) Increase the number of identified clusters. B) Decrease the number of core points, potentially labeling more points as noise. C) Have no effect on clustering results. D) Convert DBSCAN into k-means. Answer: B Explanation: A smaller ε reduces neighborhood size, causing fewer points to satisfy the core-point condition. Question 56. The “curse of dimensionality” most directly affects which of the following? A) Model interpretability B) Distance-based algorithms (e.g., k-NN, DBSCAN) because distances become less discriminative. C) Data storage costs only. D) Decision-tree depth. Answer: B Explanation: In high dimensions, all points tend to be equally far apart, weakening similarity measures. Question 57. Which of the following is NOT a typical step in text preprocessing?
C) An itemset that appears in every transaction. D) An itemset with lift equal to 1. Answer: B Explanation: Closed itemsets capture maximal groups of items sharing identical support, reducing redundancy. Question 61. Which method can be used to discretize a continuous attribute while preserving class information? A) Equal-width binning B) Equal-frequency binning C) Entropy-based (MDL) discretization D) Random binning Answer: C Explanation: Entropy-based discretization selects cut points that minimize class entropy within bins. Question 62. In a star schema, the fact table typically contains: A) Primary keys of dimension tables only. B) Foreign keys to dimensions and measurable metrics (measures). C) Only textual attributes. D) Normalized data for all dimensions. Answer: B Explanation: Fact tables store foreign keys linking to dimensions and numeric measures for analysis. Question 63. Which of the following is a common cause of “data drift” in production machine-learning systems? A) Changing hardware specifications. B) Evolution of the underlying data distribution over time. C. Increase in model training time. D. Use of batch normalization.
Answer: B Explanation: Data drift occurs when the statistical properties of input data shift, degrading model performance. Question 64. The “Lift” of a rule X → Y can be interpreted as: A) The ratio of confidence to support. B) How many times more often X and Y occur together than expected if they were independent. C) The probability of X given Y. D) The difference between support and confidence. Answer: B Explanation: Lift = confidence / (support(Y)), quantifying deviation from independence. Question 65. Which evaluation technique is most appropriate for time-series forecasting? A) Random k-fold cross-validation B. Leave-one-out cross-validation C. Rolling-origin (walk-forward) validation D. Hold-out with shuffled data Answer: C Explanation: Rolling-origin respects temporal order, training on past data and testing on future points. Question 66. In a fuzzy association-rule mining context, the support of an itemset is computed using: A) Binary presence/absence only. B) Minimum membership degree of items in each transaction. C) Average transaction length. D. Standard deviation of item frequencies. Answer: B