






























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An intermediate-level certification resource covering data strategy, analytics frameworks, and consulting methodologies. The guide blends conceptual learning with applied case studies, exam simulations, and review exercises designed to elevate consulting proficiency.
Typology: Exams
1 / 38
This page cannot be seen from the preview
Don't miss anything!































Question 1. Which phase of CRISP-DM focuses on translating business objectives into data-specific goals? A) Data Understanding B) Business Understanding C) Modeling D) Deployment Answer: B Explanation: Business Understanding defines the problem and objectives before any data work. Question 2. In hypothesis formulation, a null hypothesis (H₀) typically states that: A) There is a significant effect B) There is no effect C) The effect size is large D) The data are non-normal Answer: B Explanation: H₀ asserts no relationship or difference; it is tested against the alternative. Question 3. Which distribution is appropriate for modeling the number of website clicks per hour when the mean rate is low? A) Normal B) Binomial C) Poisson D) Uniform Answer: C Explanation: Poisson models count data with low mean rates and independent events. Question 4. A p-value of 0.03 indicates: A) 3% probability the null is true B) 3% chance of observing data if H₀ is true C) 97% confidence in H₁ D) Insignificant result Answer: B Explanation: The p-value is the probability of obtaining results as extreme as observed under H₀. Question 5. Which statistical test compares means of three independent groups? A) Paired t-test B) One-way ANOVA C) Chi-square D) Mann-Whitney U Answer: B Explanation: One-way ANOVA assesses variance among three or more group means. Question 6. Correlation does NOT imply: A) Linear relationship B) Causation C) Direction D) Strength Answer: B Explanation: Correlation measures association, not cause-effect. Question 7. In stratified sampling, strata are created to: A) Increase bias B) Reduce sample size C) Ensure representation of sub-groups D) Randomize selection
Answer: C Explanation: Stratification guarantees each subgroup is proportionally represented. Question 8. Which Python library provides the DataFrame object? A) NumPy B) Pandas C) Matplotlib D) SciPy Answer: B Explanation: Pandas defines the DataFrame for tabular data manipulation. Question 9. To compute the median of a numeric column in Pandas, you would use: A) df.mean() B) df.median() C) df.mode() D) df.sum() Answer: B Explanation: The median() method returns the middle value. Question 10. Which SQL clause filters rows after aggregation? A) WHERE B) GROUP BY C) HAVING D) ORDER BY Answer: C Explanation: HAVING applies conditions to grouped results. Question 11. A window function that calculates a running total uses which clause? A) OVER (PARTITION BY ...) B) GROUP BY C) DISTINCT D) LIMIT Answer: A Explanation: OVER defines the window for cumulative calculations. Question 12. In Jupyter Notebook, the magic command %matplotlib inline: A) Installs matplotlib B) Shows plots within the notebook C) Saves plots to disk D) Exports notebook as PDF Answer: B Explanation: It renders matplotlib graphics inline. Question 13. Which of the following is a best practice for reusable code in consulting scripts? A) Hard-code file paths B) Use global variables C) Write functions with parameters D) Duplicate code blocks Answer: C Explanation: Functions with parameters make code modular and adaptable.
Question 21. Which algorithm is best suited for predicting a categorical outcome with two classes? A) Linear regression B) K-means C) Logistic regression D) PCA Answer: C Explanation: Logistic regression models binary classification probabilities. Question 22. In a regression context, multicollinearity refers to: A) High correlation among predictors B) Non-linear relationship C) Overfitting D) Missing values Answer: A Explanation: Multicollinearity occurs when independent variables are highly correlated. Question 23. The R-squared metric measures: A) Prediction error B) Proportion of variance explained C) Model complexity D) Classification accuracy Answer: B Explanation: R² indicates how much of the dependent variance the model captures. Question 24. Which metric is most appropriate for evaluating a highly imbalanced binary classifier? A) Accuracy B) Precision C) Recall D) F1-score Answer: D Explanation: F1 balances precision and recall, useful when classes are uneven. Question 25. A 5-fold cross-validation splits data into: A) 5 training sets B) 5 test sets C) 5 pairs of train/test D) 5 validation scores Answer: C Explanation: Each fold acts once as test while the remaining folds train. Question 26. To avoid data leakage, you should: A) Scale data before split B) Use future data in training C) Include target variable in features D) Perform feature engineering after split Answer: D Explanation: Feature engineering must occur after the train/test split to prevent leakage.
Question 27. In a production data pipeline, the component that transforms raw logs into structured tables is called: A) Ingestion B) Storage C) Transformation D) Visualization Answer: C Explanation: Transformation (or ETL processing) reshapes data for analysis. Question 28. Hadoop’s core storage layer is: A) HDFS B) Spark C) Hive D) Pig Answer: A Explanation: HDFS (Hadoop Distributed File System) stores data across nodes. Question 29. Which Spark abstraction represents a distributed collection of data that can be processed in parallel? A) RDD B) DataFrame C) Dataset D) All of the above Answer: D Explanation: All three are Spark abstractions for parallel data handling. Question 30. A NoSQL database is preferred when: A) Strong ACID transactions are required B) Data is highly relational C) Schema flexibility is needed D) Data size is tiny Answer: C Explanation: NoSQL offers flexible schemas for unstructured or semi-structured data. Question 31. Which cloud service provides a fully managed data warehouse called Redshift? A) Azure B) Google Cloud C) AWS D) IBM Cloud Answer: C Explanation: Amazon Redshift is AWS’s data warehousing solution. Question 32. GDPR primarily protects: A) Intellectual property B) Personal data of EU citizens C) Corporate trade secrets D) Open-source licenses Answer: B Explanation: GDPR regulates processing of EU residents’ personal data. Question 33. An ethical AI principle that discourages using models for discriminatory profiling is: A) Transparency B) Fairness C) Explainability D) Reproducibility Answer: B Explanation: Fairness ensures models do not produce biased outcomes.
Question 40. The Central Limit Theorem justifies using the normal distribution for: A) Small samples B) Any data C) Sample means of large n D) Categorical variables Answer: C Explanation: With sufficient sample size, the distribution of sample means approximates normality. Question 41. Which metric is appropriate for evaluating a regression model’s prediction error? A) Accuracy B) ROC-AUC C) Mean Absolute Error D) Confusion Matrix Answer: C Explanation: MAE measures average absolute deviation between predicted and actual values. Question 42. In Pandas, the method to detect duplicate rows is: A) df.duplicated() B) df.unique() C) df.dropna() D) df.isnull() Answer: A Explanation: duplicated() returns a boolean Series indicating duplicate rows. Question 43. A “window function” in SQL that calculates the average salary over the previous three rows would use which frame clause? A) ROWS BETWEEN 2 PRECEDING AND CURRENT ROW B) RANGE UNBOUNDED PRECEDING C) PARTITION BY department D) ORDER BY salary Answer: A Explanation: The ROWS frame defines a moving window of the prior two rows plus the current row. Question 44. In data visualization, the “visual encoding” that maps a variable to color intensity is called: A) Position B) Shape C) Hue D) Saturation Answer: D Explanation: Saturation (or lightness) encodes magnitude via color intensity. Question 45. Which of the following is a key advantage of using Power BI over static Excel reports? A) Unlimited data rows B) Real-time interactivity C) No learning curve D) Automatic AI modeling Answer: B Explanation: Power BI dashboards refresh and allow users to explore data dynamically. Question 46. When deploying a model as a REST API, the most common container technology is: A) VirtualBox B) Docker C) Hadoop D) Spark Answer: B
Explanation: Docker packages the model and its environment for scalable API deployment. Question 47. In K-means clustering, the term “centroid” refers to: A) The farthest point B) The mean of points in a cluster C) The median of the dataset D) The distance metric Answer: B Explanation: The centroid is the arithmetic mean of all points assigned to that cluster. Question 48. Which evaluation metric is suitable for ranking problems such as recommendation lists? A) R-squared B) Mean Reciprocal Rank C) F1-score D) Log-loss Answer: B Explanation: MRR measures the average reciprocal rank of the first relevant item. Question 49. A “feature importance” plot in tree-based models helps consultants to: A) Identify overfitting B) Select most predictive variables C) Calculate p-values D) Perform clustering Answer: B Explanation: Importance scores rank predictors by their contribution to model performance. Question 50. In time-series forecasting, the “seasonal decomposition” technique separates data into: A) Trend, Noise, Outliers B) Trend, Seasonal, Residual C) Training, Validation, Test D) Input, Output, Feedback Answer: B Explanation: Seasonal decomposition extracts trend, seasonal, and residual components. Question 51. Which SQL function returns the nth highest salary in an employee table? A) MAX() B) RANK() OVER (ORDER BY salary DESC) C) SUM() D) COUNT() Answer: B Explanation: The windowed RANK function can be filtered to retrieve the nth rank. Question 52. The “bias-variance trade-off” describes: A) The balance between model simplicity and data size B) The relationship between underfitting and overfitting C) The conflict between training and test accuracy D) The need for
Answer: B Explanation: Provenance tracks where data comes from and how it has been transformed. Question 59. A “pivot table” in Excel is primarily used for: A) Data entry B) Summarizing and aggregating data C) Writing macros D) Creating charts Answer: B Explanation: Pivot tables reorganize data to compute totals, averages, etc. Question 60. In Spark, the operation that shuffles data across partitions is called: A) map() B) reduce() C) join() D) repartition() Answer: D Explanation: repartition forces a shuffle to redistribute rows. Question 61. Which metric would you monitor to detect model drift after deployment? A) Training loss B) Inference latency C) Prediction distribution change D) CPU usage Answer: C Explanation: Changes in the distribution of predictions indicate drift. Question 62. The “Pareto principle” applied to defect analysis suggests: A) 80% of defects arise from 20% of causes B) Defects are evenly distributed C) All defects have equal impact D) Defects cannot be prioritized Answer: A Explanation: The 80/20 rule helps focus remediation on key drivers. Question 63. A “heatmap” of website click-through rates across regions is an example of: A) Spatial analysis B) Temporal analysis C) Sentiment analysis D) Text mining Answer: A Explanation: Heatmaps visualize intensity over geographic or layout dimensions. Question 64. In a logistic regression output, the coefficient sign indicates: A) The probability value B) Direction of effect on log-odds C) Model accuracy D) Number of iterations Answer: B Explanation: Positive coefficients increase log-odds; negative decrease them.
Question 65. Which of the following is a primary benefit of using version control (e.g., Git) in data consulting projects? A) Faster computation B) Automatic model tuning C) Traceability of code changes D) Data encryption Answer: C Explanation: Version control records who changed what and when. Question 66. The “bootstrap” method in statistics is used to: A) Sample with replacement to estimate variability B) Reduce dimensionality C) Perform gradient descent D) Encode categorical variables Answer: A Explanation: Bootstrapping creates many resamples to approximate sampling distributions. Question 67. In an A/B test, the “control group” receives: A) The new feature B) No treatment C) Randomized variations D) Both treatments simultaneously Answer: B Explanation: The control experiences the status-quo for comparison. Question 68. Which Python function checks for normality using the Shapiro-Wilk test? A) scipy.stats.shapiro B) numpy.normaltest C) pandas.shapiro D) statsmodels.shapiro Answer: A Explanation: scipy.stats.shapiro returns the test statistic and p-value. Question 69. When visualizing a categorical variable with many levels, the most readable chart is: A) Pie chart B) Stacked bar chart C) Histogram D) Line chart Answer: B Explanation: Stacked bars compare proportions across categories without overcrowding. Question 70. A “data lake” differs from a data warehouse mainly because it: A) Stores only structured data B) Enforces strict schema on write C) Holds raw, unprocessed data D) Provides real-time analytics Answer: C Explanation: Data lakes retain data in its original format for flexible future use.
Question 77. In Tableau, a “parameter” allows users to: A) Change data source B) Dynamically adjust calculations C) Export dashboards D) Create hierarchies Answer: B Explanation: Parameters act as variables that can be swapped in formulas or filters. Question 78. The “Elbow method” is used to determine: A) Optimal number of clusters in K-means B) Best learning rate for neural nets C) Ideal depth of decision trees D) Number of principal components Answer: A Explanation: Plotting within-cluster sum of squares helps locate the elbow point. Question 79. In a regression model, the “adjusted R-squared” is preferred over R-squared when: A) Adding more predictors B) Dealing with categorical data C) Using non-linear models D) Performing classification Answer: A Explanation: Adjusted R² penalizes unnecessary predictors, preventing over-estimation. Question 80. Which of the following best describes “data democratization”? A) Limiting data access to executives B) Providing self-service analytics to all users C) Encrypting all datasets D) Outsourcing data storage Answer: B Explanation: Democratization empowers broader audiences to explore data independently. Question 81. In a decision tree, the “Gini impurity” metric measures: A) Node purity B) Model accuracy C) Feature importance D) Prediction variance Answer: A Explanation: Gini quantifies how mixed the classes are within a node. Question 82. When deploying a model with Flask, the endpoint that receives JSON input should use which HTTP method? A) GET B) POST C) PUT D) DELETE Answer: B Explanation: POST is standard for sending data payloads to a server. Question 83. The “MLOps” lifecycle adds which of the following to traditional data science? A) Model training only B) Continuous integration, delivery, and
monitoring C) Manual code reviews D) Static reporting Answer: B Explanation: MLOps emphasizes CI/CD and operational monitoring of models. Question 84. Which of the following is a common technique to reduce overfitting in tree-based models? A) Increase depth B) Decrease number of trees C) Use pruning or max_depth limits D) Remove regularization Answer: C Explanation: Pruning or setting max depth prevents overly complex trees. Question 85. In a business context, “KPIs” stand for: A) Key Performance Indicators Answer: A Explanation: KPIs are measurable values that demonstrate how effectively objectives are being met. Question 86. A “scatter matrix” plot helps analysts to: A) Visualize pairwise relationships among multiple variables B) Show time-series trends C) Display hierarchical data D) Create geographic maps Answer: A Explanation: Scatter matrix (pair plot) displays scatter plots for each variable pair. Question 87. When using the “groupby” operation in Pandas, the function that computes the sum of each group is: A) df.sum() B) df.groupby(...).sum() C) df.aggregate('sum') D) df.apply('sum') Answer: B Explanation: groupby(...).sum() aggregates each group's values. Question 88. In a binary classification confusion matrix, high false-positive rate mainly affects: A) Precision B) Recall C) Accuracy D) Specificity Answer: A Explanation: Precision = TP/(TP+FP); more FP lowers precision. Question 89. Which of the following is a recommended practice for handling categorical variables before modeling? A) One-hot encoding B) Normalization C) Log transformation D) Standard scaling Answer: A Explanation: One-hot creates binary columns for each category.
Small residuals Answer: B Explanation: Variance Inflation Factor (VIF) > indicates collinearity. Question 97. In Tableau, “dual-axis” charts are used to: A) Combine two measures with different scales B) Duplicate a worksheet C) Export data to Excel D) Create a map view Answer: A Explanation: Dual-axis overlays two graphs, each with its own axis. Question 98. The “ROC curve” plots: A) Accuracy vs. Threshold B) True Positive Rate vs. False Positive Rate C) Precision vs. Recall D) Loss vs. Epochs Answer: B Explanation: ROC displays sensitivity against 1-specificity across thresholds. Question 99. Which of the following is an unsupervised dimensionality-reduction technique? A) Linear Regression B) Principal Component Analysis C) Decision Trees D) Logistic Regression Answer: B Explanation: PCA reduces dimensions without using target labels. Question 100. In the context of consulting, a “quick win” refers to: A) Long-term strategic project B) Immediate, low-effort improvement C) High-cost technology overhaul D) Data migration Answer: B Explanation: Quick wins deliver fast value with minimal resources. Question 101. When using the “apply” method in Pandas on a DataFrame, the function is applied: A) Row-wise or column-wise depending on axis Answer: A Explanation: apply can operate across rows (axis=1) or columns (axis=0). Question 102. In a Monte Carlo simulation, the primary purpose is to: A) Optimize hyperparameters B) Estimate uncertainty by random sampling Answer: B Explanation: Monte Carlo uses repeated random draws to model variability.
Question 103. Which SQL keyword is used to combine rows from two tables with matching columns? A) UNION B) INTERSECT C) JOIN D) EXCEPT Answer: C Explanation: JOIN merges rows based on a related column. Question 104. The “precision-recall trade-off” is most critical in which scenario? A) Balanced classes B) Rare event detection (e.g., fraud) Answer: B Explanation: When positives are scarce, recall is vital, but precision must also be managed. Question 105. In a data-driven consulting engagement, the “statement of work” (SOW) typically includes: A) Technical code only B) Scope, deliverables, timeline, and responsibilities Answer: B Explanation: SOW defines what will be done, by whom, and when. Question 106. Which Python library is commonly used for creating interactive dashboards? A) Matplotlib B) Seaborn C) Plotly D) Scikit-learn Answer: C Explanation: Plotly (with Dash) enables web-based interactive visualizations. Question 107. In a regression residual plot, a funnel shape indicates: A) Heteroscedasticity B) Perfect fit C) Multicollinearity D) Normality Answer: A Explanation: Non-constant variance (heteroscedasticity) creates a funnel pattern. Question 108. The “bagging” ensemble technique reduces variance by: A) Training on the same data multiple times B) Using subsets of data and aggregating predictions Answer: B Explanation: Bootstrap aggregating (bagging) builds many models on bootstrapped samples and averages them. Question 109. Which of the following is a key component of a “data governance” framework? A) Data visualization B) Data quality standards Answer: B Explanation: Governance defines policies for data quality, security, and compliance.
Question 117. Which of the following is a common metric for evaluating clustering quality? A) Silhouette score Answer: A Explanation: Silhouette measures how similar an object is to its own cluster versus others. Question 118. When using “one-hot encoding” on a high-cardinality categorical variable, a common mitigation is: A) Drop the variable B) Use target encoding Answer: B Explanation: Target encoding reduces dimensionality while preserving information. Question 119. In a time-series model, the “lag” feature represents: A) Future value B) Past observation shifted by a time step Answer: B Explanation: Lag variables capture autocorrelation from previous periods. Question 120. Which of the following is an advantage of using “Docker” for model deployment? A) Automatic hyperparameter tuning B) Consistent environment across machines Answer: B Explanation: Docker containers encapsulate dependencies, ensuring reproducibility. Question 121. The “Mallows’ Cp” statistic is used to: A) Select the best subset of predictors in regression Answer: A Explanation: Cp balances model fit and complexity to avoid overfitting. Question 122. In a regression context, “heteroscedasticity” violates which classical assumption? A) Linear relationship B) Constant variance of errors Answer: B Explanation: Homoscedasticity assumes equal error variance across predictions. Question 123. Which of the following is a key reason to perform “data anonymization” before sharing datasets? A) Reduce file size B) Protect privacy and comply with regulations Answer: B Explanation: Anonymization removes personally identifiable information.
Question 124. In a SQL window function, “ROW_NUMBER() OVER (PARTITION BY … ORDER BY …)” is used to: A) Rank rows within each partition Answer: A Explanation: ROW_NUMBER assigns sequential numbers based on the order within each group. Question 125. The “Kullback-Leibler divergence” measures: A) Symmetric distance between two distributions B) Information loss when approximating one distribution with another Answer: B Explanation: KL divergence quantifies how one probability distribution diverges from a reference. Question 126. Which of the following best describes “transfer learning” in deep learning? A) Training a model from scratch B) Re-using a pre-trained model on a new but related task Answer: B Explanation: Transfer learning leverages existing knowledge to accelerate new tasks. Question 127. In a consulting engagement, “stakeholder mapping” helps to: A) Identify data sources B) Understand influence, interest, and communication needs Answer: B Explanation: Mapping clarifies who to involve and how to manage expectations. Question 128. The “precision-recall curve” is particularly useful when: A) Classes are balanced B) Positive class is rare Answer: B Explanation: It visualizes trade-offs without being affected by class imbalance like ROC. Question 129. Which Python function from Scikit-learn splits data into training and test sets? A) train_test_split Answer: A Explanation: It randomly partitions arrays while preserving class distribution if stratify is used.