




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This certification exam guide focuses on predictive modeling and data-driven forecasting techniques. It covers statistical analysis, machine learning fundamentals, model validation, and business applications. Candidates gain skills to interpret data trends and support informed strategic decision-making.
Typology: Exams
1 / 106
This page cannot be seen from the preview
Don't miss anything!





























































































Question 1. Which of the following best describes the primary purpose of stakeholder requirement gathering in the business problem framing stage? A) To select the most advanced machine‑learning algorithm B) To define the project’s scope, objectives, and success criteria C) To clean and preprocess the raw data D) To deploy the model into production Answer: B Explanation: Stakeholder requirement gathering is used to understand what the business wants to achieve, set clear objectives, and establish measurable success criteria (KPIs) before any technical work begins. Question 2. In distinguishing symptoms from root causes, which analytical approach is most appropriate? A) Descriptive statistics on raw data B) Conducting a Root Cause Analysis (RCA) using techniques like the 5 Whys C) Randomly sampling data points for model training D) Deploying a pretrained neural network Answer: B Explanation: RCA methods such as the 5 Whys help trace observed symptoms back to underlying root causes, enabling a focused analytics solution. Question 3. When establishing project constraints, which factor is least likely to be a limitation? A) Budgetary ceiling B) Availability of relevant data
C) Desired model interpretability D) The color scheme of the final dashboard Answer: D Explanation: While visual design matters, it does not constrain the core analytics project; budget, data, and interpretability are typical constraints. Question 4. An initial benefit assessment estimates a 15 % increase in revenue after implementing a predictive churn model. Which metric best captures this estimate? A) Accuracy B) Return on Investment (ROI) C) Mean Absolute Error (MAE) D) Confusion matrix Answer: B Explanation: ROI quantifies the financial benefit relative to the cost of the analytics solution, making it the appropriate metric for benefit assessment. Question 5. Which type of analytics is most suitable when a company wants to understand why sales dropped in the last quarter? A) Descriptive B) Diagnostic C) Predictive D) Prescriptive Answer: B
C) Precision at top‑k D) Mean Absolute Percentage Error Answer: C Explanation: Precision at top‑k evaluates how many of the highest‑scored applicants (the top‑k) are actually high‑risk, directly linking to the KPI. Question 9. Which assumption is critical when applying ARIMA models to time‑series data? A) Data must be normally distributed B) The series is stationary (constant mean and variance) C) All variables are categorical D) The dataset contains no missing values Answer: B Explanation: ARIMA requires stationarity; non‑stationary data must be differenced or transformed before modeling. Question 10. In the context of model building, what does the term “overfitting” refer to? A) A model that performs equally well on training and unseen data B) A model that captures noise in the training data, reducing generalization C) Using too few features in the model D) Deploying the model without validation Answer: B
Explanation: Overfitting occurs when a model learns patterns specific to the training set, including noise, leading to poor performance on new data. Question 11. Which data source is considered “external” for most predictive analytics projects? A) Company’s ERP system B) Customer transaction logs C) Public weather API D) Internal employee performance database Answer: C Explanation: External data originates outside the organization, such as a weather API, whereas the others are internal sources. Question 12. When handling missing values in a dataset, which approach is most appropriate for a categorical variable with a large number of unique levels? A) Mean imputation B) Median imputation C) Mode imputation D) Creating a separate “Missing” category Answer: D Explanation: For high‑cardinality categorical variables, assigning a distinct “Missing” category preserves information without biasing the distribution. Question 13. Which technique is best suited for detecting outliers in a univariate numeric feature?
Answer: C Explanation: Scatter plots display the relationship between two continuous variables, making correlation patterns visible. Question 16. When scaling numeric features, which method preserves the original distribution shape while limiting values to a specific range? A) Min‑Max scaling B) Standardization (z‑score) C) Log transformation D) Rank transformation Answer: A Explanation: Min‑Max scaling rescales values to a defined range (e.g., 0–1) while maintaining the original distribution shape. Question 17. Which supervised learning algorithm is most appropriate for predicting a binary outcome with a highly imbalanced class distribution? A) Linear Regression B) Logistic Regression with class weighting C) K‑Means clustering D) Principal Component Analysis Answer: B Explanation: Logistic Regression can incorporate class weights to compensate for imbalance, improving minority class prediction.
Question 18. In unsupervised learning, which algorithm groups data based on centroid proximity? A) Decision Tree B) K‑Means clustering C) Random Forest D) Naïve Bayes Answer: B Explanation: K‑Means partitions data into K clusters by minimizing distances to cluster centroids. Question 19. Which model selection criterion explicitly balances model fit with complexity to avoid overfitting? A) Accuracy B) R‑squared C) Akaike Information Criterion (AIC) D) Confusion matrix Answer: C Explanation: AIC penalizes model likelihood based on the number of parameters, encouraging parsimonious models. Question 20. When choosing a tool for large‑scale data processing, which platform is specifically designed for distributed computation? A) Microsoft Excel B) SAS Base C) Apache Spark
Question 23. Which validation technique provides an estimate of model performance that is less sensitive to a single train‑test split? A. Hold‑out validation B. Leave‑one‑out cross‑validation C. K‑fold cross‑validation D. Randomized hold‑out Answer: C Explanation: K‑fold cross‑validation averages performance over multiple folds, reducing variance caused by a single split. Question 24. In hyperparameter tuning, which method systematically evaluates all possible combinations within a predefined grid? A) Random Search B) Grid Search C) Bayesian Optimization D) Gradient Descent Answer: B Explanation: Grid Search exhaustively tests each combination of hyperparameters defined in the grid. Question 25. Which metric is most appropriate for evaluating a regression model where large errors are particularly undesirable? A) Mean Absolute Error (MAE) B) Root Mean Squared Error (RMSE)
C) R‑squared D) Accuracy Answer: B Explanation: RMSE penalizes larger errors more heavily than MAE due to squaring, making it suitable when large deviations are costly. Question 26. A confusion matrix shows high true‑negative count but low true‑positive count. Which business implication is most likely? A) Model is over‑predicting the positive class B) Model is missing many actual positives (low recall) C) Model has perfect precision D) Model is balanced across classes Answer: B Explanation: Low true‑positives indicate the model fails to identify many actual positive cases, leading to low recall. Question 27. Which curve visualizes the trade‑off between true‑positive rate and false‑positive rate across thresholds? A) Precision‑Recall curve B) ROC curve C) Lift chart D) Calibration plot Answer: B
C. Visualizing key insights with a dashboard and narrative captions D. Sharing a PDF of the full statistical report Answer: C Explanation: A dashboard combined with concise narrative helps executives grasp insights quickly. Question 31. Model drift refers to: A. The initial training error decreasing over time B. The gradual degradation of model performance due to changing data patterns C. The model’s parameters becoming unstable during training D. The model’s ability to adapt automatically to new data Answer: B Explanation: Model drift occurs when the statistical properties of input data shift, causing performance decline. Question 32. Which practice helps mitigate model drift in production? A. Ignoring new data after deployment B. Scheduling periodic retraining with recent data C. Hard‑coding model coefficients D. Using only static features Answer: B Explanation: Regularly retraining the model on fresh data helps it stay aligned with current patterns.
Question 33. GDPR primarily regulates: A. Model interpretability standards B. Data privacy and individuals’ rights over personal data C. The computational complexity of algorithms D. Open‑source licensing Answer: B Explanation: The General Data Protection Regulation governs the collection, processing, and storage of personal data in the EU. Question 34. Which technique can reduce bias introduced by an imbalanced training dataset? A. Downsampling the majority class B. Ignoring the minority class C. Using only continuous variables D. Applying a higher learning rate Answer: A Explanation: Downsampling balances class representation, helping the model learn the minority class better. Question 35. Documentation of a predictive model should NOT include: A. Data lineage and source description B. Hyperparameter settings used in training C. Personal opinions about the model’s market potential
Question 38. In the context of predictive analytics, “prescriptive analytics” primarily adds: A. Historical data summarization B. Recommendations for optimal actions based on predictions C. Simple correlation analysis D. Data cleaning procedures Answer: B Explanation: Prescriptive analytics goes beyond prediction to suggest the best course of action. Question 39. Which of the following is an example of a “symptom” rather than a “root cause” in a supply‑chain context? A. Frequent stockouts of a specific SKU B. Inefficient demand forecasting algorithm C. Lack of real‑time inventory visibility D. Poor vendor lead‑time variability Answer: A Explanation: Stockouts are observable outcomes (symptoms); the underlying forecasting or visibility issues are root causes. Question 40. Which metric would you use to evaluate a model that predicts the exact dollar amount of next month’s sales? A) Accuracy B) RMSE C) Precision
Answer: B Explanation: RMSE measures the average magnitude of prediction errors for continuous targets like sales dollars. Question 41. Which preprocessing step is essential before applying a distance‑based algorithm such as K‑Nearest Neighbors? A) One‑hot encoding of categorical variables only B) Scaling numeric features to a comparable range C) Removing all outliers D) Converting all variables to binary Answer: B Explanation: Distance calculations are sensitive to feature scales; scaling ensures each variable contributes proportionally. Question 42. In time‑series forecasting, which component captures the regular pattern that repeats over a fixed period? A) Trend B) Seasonality C) Noise D) Autocorrelation Answer: B Explanation: Seasonality represents repeating patterns (e.g., monthly, weekly) in the data.
C) They require no hyperparameter tuning D) They always produce linear decision boundaries Answer: B Explanation: Ensembles combine multiple models to lower variance and often achieve higher predictive accuracy. Question 46. In a classification problem with three classes, which averaging method is appropriate for computing a single F1‑score? A) Macro averaging B) Micro averaging C) Weighted averaging D) All of the above are possible, depending on the focus Answer: D Explanation: All three averaging strategies can be used; macro treats all classes equally, micro aggregates contributions, weighted accounts for class frequency. Question 47. Which data transformation is commonly applied to right‑skewed monetary variables to approximate normality? A) Square root transformation B) Log transformation C) Z‑score standardization D) Min‑Max scaling Answer: B
Explanation: Log transformation compresses large values, reducing right skewness and often improving model performance. Question 48. When encoding a categorical variable with an inherent order (e.g., “Low”, “Medium”, “High”), which encoding technique is most suitable? A) One‑hot encoding B) Label encoding preserving order C) Binary encoding D) Frequency encoding Answer: B Explanation: Ordinal (label) encoding maintains the natural order, allowing models to interpret the relative magnitude. Question 49. Which of the following best describes “data provenance”? A) The process of normalizing data B) Tracking the origins, lineage, and transformations of data C) Visualizing data distributions D) Encrypting data for security Answer: B Explanation: Data provenance records where data came from and how it has been processed, essential for reproducibility and compliance. Question 50. In a predictive maintenance scenario, which type of analytics would you primarily use to schedule service before a failure occurs? A) Descriptive