



















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This practice exam for predictive analytics and data mining certification features 29 multiple-choice questions. It covers key concepts like crisp-dm, kdd, analytics types (descriptive, diagnostic, predictive, prescriptive), data types (nominal, ordinal, interval, ratio), unstructured data, etl, imputation, outlier identification, data transformation (box-cox), encoding (one-hot encoding), scaling, pca, feature selection, normal distribution, hypothesis testing, bayesian concepts, data visualization, regression, and classification trees. Each question includes a detailed explanation, making it a valuable resource for exam preparation and understanding data mining and predictive analytics fundamentals. It tests and reinforces knowledge of essential topics, providing a comprehensive review for students and professionals, ensuring a thorough understanding of principles and techniques.
Typology: Exams
1 / 91
This page cannot be seen from the preview
Don't miss anything!




















































































Question 1. Which phase of the CRISP‑DM methodology focuses on understanding business objectives and converting them into a data‑driven problem? A) Data Preparation B) Business Understanding C) Modeling D) Deployment Answer: B Explanation: The Business Understanding phase translates business goals into a data mining problem, defines success criteria, and creates a project plan. Question 2. In the KDD process, which step directly follows data cleaning and integration? A) Data Selection B) Data Transformation C) Data Mining D) Evaluation Answer: B Explanation: After cleaning and integrating data, the next step is transforming it into suitable formats (e.g., normalization, aggregation) for mining. Question 3. Which analytics type is primarily concerned with answering “What will happen next?” A) Descriptive B) Diagnostic C) Predictive D) Prescriptive Answer: C
Explanation: Predictive analytics uses historical data and statistical models to forecast future events. Question 4. A bank wants to predict customer churn. The variable indicating whether a customer left is the: A) Feature B) Target variable C) Predictor variable D) Independent variable Answer: B Explanation: The target (or dependent) variable is the outcome the model is trained to predict— in this case, churn. Question 5. Which error type is most costly for a fraud detection system that blocks legitimate transactions? A) Type I error (false positive) B) Type II error (false negative) C) Both are equally costly D) Neither is costly Answer: A Explanation: A false positive blocks a legitimate transaction, causing customer dissatisfaction and potential revenue loss. Question 6. Which data type best describes “Customer satisfaction rating on a scale of 1 ‑ 5 ”? A) Nominal B) Ordinal
C) K‑Nearest Neighbors imputation D) Listwise deletion Answer: C Explanation: K‑NN imputation predicts missing values based on similar records, leveraging more data structure. Question 10. An outlier is identified using the IQR method. Which rule defines an extreme outlier? A) Value < Q1 – 1.5·IQR or > Q3 + 1.5·IQR B) Value < Q1 – 3·IQR or > Q3 + 3·IQR C) Z‑score > 2 or < – 2 D) Value > 99th percentile Answer: B Explanation: Extreme outliers are those beyond 3·IQR from the quartiles; the 1.5·IQR rule defines regular outliers. Question 11. Applying a Box‑Cox transformation is most appropriate when: A) Data contain categorical variables B) Data are already normally distributed C) Data are positively skewed D) Data have missing values Answer: C Explanation: Box‑Cox can reduce positive skewness and help achieve normality. Question 12. Which encoding technique can cause the “dummy variable trap” if not handled properly?
A) One‑hot encoding B) Label encoding C) Target encoding D) Frequency encoding Answer: A Explanation: One‑hot encoding creates perfectly collinear columns; dropping one dummy avoids the trap. Question 13. Which scaling method is robust to outliers? A) Min‑Max scaling B) Z‑score standardization C) Robust scaling (median and IQR) D) Decimal scaling Answer: C Explanation: Robust scaling uses median and IQR, reducing outlier influence. Question 14. Principal Component Analysis (PCA) primarily aims to: A) Increase the number of features B) Reduce dimensionality while preserving variance C) Select the most correlated features D) Encode categorical variables Answer: B Explanation: PCA creates orthogonal components that capture the maximum variance with fewer dimensions.
Question 18. A two‑tailed t‑test with a p‑value of 0.03 indicates: A) Fail to reject the null hypothesis at α = 0. B) Reject the null hypothesis at α = 0. C) Reject the null hypothesis at α = 0. D) No conclusion can be drawn Answer: C Explanation: p = 0.03 < 0.05, so the null hypothesis is rejected at the 5% significance level. Question 19. Which test is appropriate for comparing the means of three independent groups? A) Paired t‑test B) One‑way ANOVA C) Chi‑square test D) Wilcoxon rank‑sum test Answer: B Explanation: One‑way ANOVA assesses whether at least one group mean differs from the others. Question 20. In a chi‑square test for independence, a large χ² statistic relative to the critical value suggests: A) Variables are independent B) Variables are dependent C) Sample size is too small D) Data are normally distributed Answer: B
Explanation: A large χ² indicates observed frequencies deviate significantly from expected frequencies, implying dependence. Question 21. Which Bayesian concept updates prior beliefs with new evidence? A) Likelihood function B) Posterior probability C) P‑value D) Confidence interval Answer: B Explanation: The posterior combines the prior distribution and the likelihood of observed data. Question 22. A histogram is most suitable for visualizing: A) Categorical frequencies B) Distribution of a continuous variable C) Correlation between two variables D) Time‑series trends Answer: B Explanation: Histograms display the frequency of continuous data across bins. Question 23. In a scatter plot, a strong linear pattern indicates: A) High variance B) Strong correlation C) Non‑linear relationship D) Presence of outliers Answer: B
Explanation: Lasso (L1 penalty) can shrink coefficients to zero, performing variable selection. Question 27. In logistic regression, the odds ratio for a predictor of 1.5 means: A) The predictor increases odds by 150% per unit increase B) The probability increases by 1. C) The log‑odds increase by 1. D) The predictor has no effect Answer: A Explanation: An odds ratio of 1.5 indicates a 50% increase in odds for each unit increase in the predictor. Question 28. Which impurity measure is used by the CART algorithm for classification trees? A) Entropy B) Gini impurity C) Information gain ratio D) Chi‑square Answer: B Explanation: CART uses Gini impurity (or variance for regression) to decide splits. Question 29. Pruning a decision tree primarily helps to: A) Increase depth B) Reduce overfitting C) Add more features D) Convert it to a linear model Answer: B
Explanation: Pruning removes branches that do not improve validation performance, mitigating overfitting. Question 30. Random Forests achieve variance reduction by: A) Boosting weak learners sequentially B) Using a single deep tree C) Bagging multiple decorrelated trees D) Applying gradient descent Answer: C Explanation: Random Forests train many trees on bootstrap samples with random feature subsets, reducing variance. Question 31. In Gradient Boosting, each new tree is trained to predict: A) The original target variable B) The residual errors of the previous ensemble C) Random noise D) Feature importance scores Answer: B Explanation: Boosting fits each successive learner to the residuals (errors) of the current model to improve performance. Question 32. Which kernel function maps data into an infinite‑dimensional space? A) Linear kernel B) Polynomial kernel C) Radial Basis Function (RBF) kernel D) Sigmoid kernel
D) The clustering algorithm speed Answer: B Explanation: Plotting within‑cluster sum of squares versus k shows an “elbow” point indicating diminishing returns. Question 36. Silhouette score values close to 1 indicate: A) Overlapping clusters B) Well‑separated, cohesive clusters C) Poor clustering structure D) Random assignment of points Answer: B Explanation: High silhouette values mean each point is close to its own cluster and far from others. Question 37. In hierarchical agglomerative clustering, the “complete linkage” criterion defines cluster distance as: A) Minimum distance between any two points in the clusters B) Maximum distance between any two points in the clusters C) Average distance between all point pairs D) Distance between cluster centroids Answer: B Explanation: Complete linkage uses the farthest pairwise distance, producing compact clusters. Question 38. DBSCAN can discover clusters of arbitrary shape because it: A) Uses k‑means centroids B) Relies on density reachability and connectivity
C) Requires pre‑specifying the number of clusters D) Optimizes a global variance criterion Answer: B Explanation: DBSCAN groups points based on dense neighborhoods, allowing non‑convex shapes. Question 39. An internal clustering evaluation metric that does NOT require ground‑truth labels is: A) Adjusted Rand Index B) Purity C) Silhouette coefficient D) F‑measure Answer: C Explanation: Silhouette coefficient assesses cohesion and separation using only the data and cluster assignments. Question 40. In market basket analysis, the “support” of an itemset represents: A) The probability that the rule is correct B) The proportion of transactions containing the itemset C) The lift value relative to independence D) The confidence of the rule Answer: B Explanation: Support counts how frequently an itemset appears in the dataset. Question 41. Which of the following is a limitation of the Apriori algorithm? A) It cannot handle continuous variables
Question 44. Which cross‑validation technique provides the most unbiased estimate of model performance for small datasets? A) Hold‑out validation B. 5‑fold cross‑validation C. Leave‑One‑Out Cross‑Validation (LOOCV) D. Bootstrap validation Answer: C Explanation: LOOCV uses every observation once as a test set, minimizing bias for limited data. Question 45. When dealing with imbalanced classes, which metric is more informative than overall accuracy? A) Mean Squared Error B) F1‑score C) R² D) Adjusted R² Answer: B Explanation: F1‑score balances precision and recall, highlighting performance on minority classes. Question 46. In time‑series forecasting, the “seasonality” component refers to: A) Random noise B) Long‑term trend C) Repeating patterns at fixed intervals D) Exogenous variables Answer: C Explanation: Seasonality captures systematic, periodic fluctuations (e.g., monthly sales peaks).
Question 47. Which method decomposes a time series into trend, seasonal, and residual components using moving averages? A) ARIMA B) Exponential Smoothing C) STL (Seasonal‑Trend decomposition using Loess) D. Prophet Answer: C Explanation: STL separates the three components via locally weighted regression (LOESS). Question 48. In an ARIMA(p,d,q) model, the parameter d stands for: A) Number of autoregressive terms B) Degree of differencing to achieve stationarity C) Number of moving‑average terms D) Seasonal period Answer: B Explanation: d indicates how many times the series is differenced to remove trends. Question 49. Which regularization technique is appropriate when you suspect many irrelevant features but also want to keep correlated predictors? A) Lasso (L1) B) Ridge (L2) C) Elastic Net (L1 + L2) D) No regularization Answer: C
Answer: B Explanation: Concept drift occurs when the underlying relationship between features and target evolves, degrading model performance. Question 53. Which of the following best describes “over‑sampling” in handling class imbalance? A) Removing majority class observations B) Replicating minority class observations C) Adding noise to all features D. Reducing dimensionality Answer: B Explanation: Over‑sampling creates additional minority class samples (e.g., SMOTE) to balance the dataset. Question 54. The “bias‑variance trade‑off” implies that: A) Reducing bias always reduces variance B) A model with low bias will always have low variance C) Decreasing bias typically increases variance, and vice versa D. Bias and variance are unrelated Answer: C Explanation: Improving model fit (lower bias) often makes it more sensitive to training data (higher variance). Question 55. Which performance metric is appropriate for evaluating a regression model when large errors are particularly undesirable? A) Mean Absolute Error (MAE) B) Mean Squared Error (MSE)
D) Accuracy Answer: B Explanation: MSE squares errors, penalizing larger deviations more heavily than MAE. Question 56. In a business context, a “lift chart” is used to: A) Compare model accuracy across different algorithms B) Visualize the improvement of a predictive model over random selection C. Display residuals distribution D) Show ROC curves for multiple thresholds Answer: B Explanation: Lift charts illustrate how much better a model is at identifying positive cases compared to random guessing. Question 57. Which of the following is NOT a typical step in feature selection using a wrapper method? A) Training a model on a subset of features B) Evaluating performance on validation data C) Ranking features by correlation with the target D. Iteratively adding/removing features based on model score Answer: C Explanation: Correlation ranking is a filter technique; wrappers involve model training and evaluation. Question 58. In a data warehouse, “star schema” is characterized by: A) Multiple many‑to‑many relationships