

































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This comprehensive exam preparation guide covers advanced data science topics from Chapter 09 of the SSM, including deep learning (transformers, GNNs, VAEs), natural language processing (BERT, attention, topic modeling), reinforcement learning (DQN, reward shaping), Bayesian methods (Gaussian processes, hierarchical models), causal inference (ATE, propensity scores, IV), and ethical AI. With 100 rigorous, graduate-level questions and detailed answers already graded A+, this resource is ideal for mastering high-dimensional modeling, probabilistic inference, and real-world applications like healthcare and finance.
Typology: Exams
1 / 41
This page cannot be seen from the preview
Don't miss anything!


































Subject Area Advanced Data Science Applications
Description This exam covers advanced topics in data science including deep learning, natural language processing, reinforcement learning, Bayesian methods, causal inference, and ethical AI, as presented in Chapter 09 of the SSM. Emphasis is on rigorous theoretical understanding and practical application to complex, real-world problems.
Expected Grade A+
Total Questions 100
Duration 3 hours
Learning Outcomes 1. Analyze and compare advanced machine learning architectures for specific data modalities.
Accreditation This exam meets the standards for graduate-level data science programs at R research universities in the United States.
1. In a high-dimensional setting with p >> n, where ordinary least squares fails due to multicollinearity and overfitting, which of the following regularization techniques yields a sparse solution by inherently performing variable selection, and under what condition does it guarantee consistent selection of the true model?
Answer: Lasso regression, under the irrepresentable condition on the design matrix.
Lasso regression uses L1 penalty, which shrinks some coefficients exactly to zero, performing variable selection. The irrepresentable condition is necessary for consistent model selection in high dimensions; it essentially requires that irrelevant variables are not too correlated with the relevant ones. Ridge does not select variables; elastic net can but requires additional conditions; PCR does not perform variable selection.
2. A research team applies a deep neural network to classify medical images into disease categories. The model achieves 99% accuracy on the test set but only 55% accuracy on a new dataset collected from a different hospital with a different patient demographic. Which of the following is the most likely cause of this performance drop, and which mitigation strategy is most appropriate?
Answer: Data drift (covariate shift) between training and new data; use domain adaptation techniques.
The drop in performance on a different hospital's data is typical of covariate shift where the distribution of input features differs. Domain adaptation methods (e.g., adversarial training, importance weighting) adjust the model to the new distribution. Overfitting would show poor generalization within the same distribution; label noise would not cause such a systematic drop; concept drift is less likely as the disease definition is unchanged.
3. In a reinforcement learning setting with a continuous state space, which of the following best describes the role of the Bellman error in the context of fitted Q-iteration (FQI) using neural networks?
Answer: It is the mean squared error between the Q-network's output and the target computed as the immediate reward plus the maximum Q-value of the next state, and FQI minimizes this error iteratively.
In FQI, the Bellman error is the squared difference between the current Q-value e s t i m a t e a n d t h e t a r g e t ( r e w a r d + ³ m a x Q ( s ' , a ' ) ). T h e a l g o r i t h m Q-network to minimize this error using regression, analogous to supervised learning. Options B, C, and D misrepresent the error or the method: FQI does not use policy gradient, does not adjust discount factor, and is not tabular.
6. In causal inference, an investigator wants to estimate the average treatment effect (ATE) from observational data using propensity score matching. After matching, the standardized mean difference for a covariate X is 0.05, and the variance ratio is 0.9. Which of the following conclusions is most appropriate regarding covariate balance?
Answer: Balance is achieved because both metrics are below the common thresholds of 0.1 for mean difference and between 0.5 and 2 for variance ratio.
Commonly accepted thresholds for good covariate balance are standardized mean difference < 0.1 (or 0.25) and variance ratio between 0.5 and 2. Here both are within these ranges, so balance is adequate. Option B incorrectly interprets variance ratio; values less than 1 are fine. Option C relies on p-values which are sample-size dependent; balance is about distributional similarity, not significance. Option D is unrealistic; exact zero is not required.
7. A data science team builds a model to predict loan defaults. The model is trained on historical data that includes a feature 'race' (categorical). To ensure fairness, the team removes the race feature from the model. However, the model still exhibits disparate impact across racial groups. Which of the following is the most likely explanation?
Answer: Other features in the model (e.g., zip code, income) act as proxies for race, leading to indirect discrimination.
Even when protected attributes are removed, correlated features can act as proxies, leading to disparate impact. This is a well-known issue in fairness (e.g., redlining via zip code). Overfitting (A) does not directly cause disparate impact; non-linearity (C) is not inherently biased; sampling bias (D) can affect accuracy but is not the primary reason for persistent disparity after removing the feature.
8. In the context of variational autoencoders (VAEs), the evidence lower bound (ELBO) consists of a reconstruction term and a KL divergence term. If the KL d i v e r g e n c e t e r m i s m u l t i p l i e d b y a f a c t o r ² > 1 ( a s i n ² - V A effect on the learned latent representation?
Answer: The latent representation becomes more compact and disentangled, but reconstruction quality may degrade.
² - V A E i n c r e a s e s t h e w e i g h t o n t h e K L t e r m , e n f o r c i n g a s t r o n g e r Gaussian) on the latent space. This encourages independence among latent dimensions (disentanglement) and a more compact representation, but can reduce reconstruction quality because the model is less flexible. Option B is opposite; C is a general property of VAEs; D is not the primary effect.
9. A time series dataset exhibits both trend and seasonality. A data scientist fits an ARIMA model after differencing the series once. The autocorrelation function (ACF) of the residuals shows a significant spike at lag 12, and the partial autocorrelation function (PACF) shows a significant spike at lag 12 as well. Which modification to the model is most appropriate?
Answer: Add a seasonal component with period 12, using SARIMA(p,d,q)(P,D,Q)_12.
The significant spike at lag 12 in both ACF and PACF suggests seasonality with period
Answer: Use a custom partition function that distributes keys more evenly, such as based on a hash of the key combined with a random salt.
Data skew often arises from a non-uniform distribution of keys. A custom partition function can balance the load across reducers by, e.g., salting keys to spread out popular keys. Increasing mappers (A) doesn't address skew; combiners (C) can reduce data volume but not skew if keys are still skewed; switching frameworks (D) is not a targeted solution.
13. Consider a dataset with 100 features and 10,000 samples. You perform PCA and find that the first principal component explains 40% of the variance. Which of the following interpretations is most accurate regarding the utility of the first principal component for downstream supervised learning?
Answer: The first principal component may not be predictive if the variance it captures is unrelated to the target variable, and using it could reduce model performance.
PCA is unsupervised and maximizes variance, not correlation with the target. Thus, the first PC may capture noise or irrelevant variance, and using it could harm predictive performance. Option A is incorrect because high variance does not imply predictive power. Option C is false because interpretability is not guaranteed and accuracy may drop. Option D misstates the objective of PCA.
14. A researcher is comparing two clustering algorithms on the same dataset. Algorithm 1 produces clusters with high intra-cluster similarity but low inter-cluster separation. Algorithm 2 produces clusters with moderate intra-cluster similarity but high inter-cluster separation. Which of the following cluster validity indices would most clearly favor Algorithm 2 over Algorithm 1?
Answer: Silhouette coefficient
The silhouette coefficient measures both intra-cluster cohesion and inter-cluster separation, and higher values indicate better-defined clusters. Algorithm 2's high separation would yield higher silhouette scores. Option A's Davies-Bouldin index favors low intra-cluster scatter and high inter-cluster distance, but it is a ratio that could be ambiguous. Option C's SSE only captures intra-cluster variance and would favor Algorithm 1. Option D's Rand index requires ground truth labels, which are not provided.
15. In a deep reinforcement learning setting for autonomous driving, the agent receives a sparse reward signal (+1 for reaching the destination, 0 otherwise). To improve learning efficiency, which technique is most appropriate for generating a dense reward signal?
Answer: Reward shaping by adding a potential-based function that gives positive rewards for approaching the destination.
Reward shaping with a potential-based function (e.g., negative distance to goal) provides a dense reward signal that guides the agent, and if it is potential-based, it preserves optimal policies. Option B's high gamma helps but does not create dense rewards. Option C and D improve learning but do not directly address reward sparsity.
16. A data scientist is analyzing a text corpus using Latent Dirichlet Allocation (LDA) with 10 topics. After training, they observe that many topics are dominated by common stopwords and lack semantic coherence. Which of the following modifications is most likely to improve topic interpretability?
Answer: Remove stopwords and apply TF-IDF weighting before LDA.
Stopwords dilute topic-specific words, so removing them and using TF-IDF (which downweights common words) helps reveal meaningful topics. Option A may lead to even more fragmented topics. Option C's HDP does not directly address stopword dominance. Option D's asymmetric priors can help but are secondary to preprocessing.
17. In the context of differential privacy, a data curator wants to release the mean of a numeric attribute from a dataset of size n. They apply the Laplace mechanism with s c a l e p a r a m e t e r b = 1 / μ. W h i c h o f t h e f o l l o w i n g s t a t e m e n t s privacy-accuracy trade-off is correct?
A n s w e r : F o r a f i x e d μ , i n c r e a s i n g n r e d u c e s t h e v a r i a n c e o added to the mean.
T h e L a p l a c e m e c h a n i s m a d d s n o i s e w i t h s c a l e b = ” f / μ , w h e r e ” f t h e m e a n , s e n s i t i v i t y i s 1 / n , s o t h e n o i s e s c a l e i s 1 / ( n μ ). T h u s , f r e d u c e s n o i s e v a r i a n c e. O p t i o n B i s f a l s e : d e c r e a s i n g μ i n c r e a s e but not about trade-off. Option D is false: Gaussian mechanism provides ( μ , ´ ) - d i f f e r e n t i a l p r i v a c y , n o t p u r e μ , a n d m a y r e q u i r e l a r g e r n o
18. A machine learning engineer is tuning a gradient boosting model (XGBoost) for a binary classification task with imbalanced classes (1% positive). They want to maximize the area under the precision-recall curve (AUPRC). Which combination of hyperparameter adjustments is most likely to improve AUPRC?
Answer: Increase max_depth and reduce learning_rate, while using scale_pos_weight set to the ratio of negative to positive samples.
For imbalanced data, setting scale_pos_weight to the ratio of negative to positive samples helps the model focus on the minority class. Increasing max_depth allows capturing complex patterns in the minority class, while reducing learning_rate with more trees improves generalization. Option B's decrease in depth may underfit. Option C's high min_child_weight may prevent learning rare patterns. Option D's use of logitraw and fixed threshold is not optimal for imbalanced data.
22. In a high-dimensional sparse dataset with 10,000 features and 500 samples, you apply LASSO regression for feature selection. The optimal regularization parameter » i s c h o s e n v i a c r o s s - v a l i d a t i o n. W h i c h o f t h e f o l l o w i n g b e t h e n u m b e r o f n o n - z e r o c o e f f i c i e n t s a s » i n c r e a s e s f r o m 0 t coefficients to zero?
Answer: The number of non-zero coefficients decreases monotonically, with some c o e f f i c i e n t s b e c o m i n g e x a c t l y z e r o a t d i s t i n c t » v a l u e s.
LASSO performs L1 regularization, which shrinks coefficients and can set some to e x a c t l y z e r o a s » i n c r e a s e s. T h e r e g u l a r i z a t i o n p a t h s h o w s t h a t c m o d e l a t d i f f e r e n t » v a l u e s , b u t o v e r a l l t h e n u m b e r o f n o n - z e r o c monotonically. Option B describes ridge regression (L2). Option C is incorrect; the grouping effect is not monotonic in number. Option D describes a step function not typical for LASSO.
23. A data scientist uses a transformer-based model for time series forecasting of electricity demand. The model incorporates positional encoding and multi-head self-attention. To improve performance, they consider adding a recurrent layer after the transformer encoder. Which of the following is the most accurate statement about this modification?
Answer: It would introduce inductive bias favoring sequential order, potentially improving modeling of local temporal patterns.
Adding a recurrent layer (e.g., LSTM) after a transformer can help capture local sequential patterns that self-attention might overlook, as transformers are less biased towards local structure. Option A is less relevant because transformers already handle long-range dependencies. Option C is false; recurrent layers have linear complexity in sequence length. Option D is incorrect because transformers may not fully capture local patterns without explicit inductive bias.
24. In a federated learning setting with 100 clients, each having non-IID data, the server aggregates model updates using FedAvg. After several rounds, the global model accuracy plateaus. Which of the following modifications is most likely to improve convergence under heterogeneous data distributions?
Answer: Introducing a proximal term in the local objective to penalize deviations from the global model.
FedProx (proximal term) addresses non-IID data by keeping local updates close to the global model, improving stability and convergence. Option A may worsen heterogeneity effects. Option B is already part of FedAvg (weighted by dataset size). Option D is standard but does not directly address the plateau caused by heterogeneity.
25. A causal inference study uses instrumental variables (IV) to estimate the effect of a treatment on an outcome. The instrument Z is binary, and the treatment D is continuous. Which of the following conditions must hold for the IV estimator to be consistent?
Answer: Z must be correlated with D, and Z must affect the outcome only through D, and there must be no confounding between Z and the outcome.
IV consistency requires relevance (correlation with D), exclusion restriction (Z affects outcome only through D), and independence (no unmeasured confounding of Z and outcome). Option B is the conditional independence assumption for propensity scores, not IV. Option C is too strong; random assignment is not required for IV. Option D describes LATE assumptions but is incomplete; monotonicity is needed for binary treatment.
26. A data scientist applies the DBSCAN clustering algorithm to a dataset with varying density clusters. They set epsilon = 0.5 and minPts = 5. After visualization, they notice that a sparse cluster is split into multiple small clusters while a dense cluster is correctly identified. Which adjustment is most appropriate to capture the sparse cluster as a single entity?
Answer: Use OPTICS instead, which does not require a global epsilon.
DBSCAN with a single global epsilon fails on varying densities. OPTICS produces a reachability plot that allows extracting clusters of different densities. Option A might merge dense clusters incorrectly. Option B would make it harder to find sparse clusters. Option C might increase noise points. OPTICS is designed for such scenarios.
30. In a natural language processing task, you fine-tune a pre-trained BERT model on a small domain-specific corpus for sentiment analysis. The model achieves high accuracy on the training set but poor performance on a held-out test set from the same domain. Which of the following is the most likely cause?
Answer: Catastrophic forgetting of pre-trained knowledge due to a high learning rate during fine-tuning.
Fine-tuning a large pre-trained model on a small dataset with a high learning rate can cause catastrophic forgetting, where the model overfits to the small corpus and loses general knowledge. Option B is incorrect; BERT can be fine-tuned for sentiment analysis. Option C is less likely because BERT uses subword tokenization. Option D is not the primary issue; BERT's attention is typically sufficient.
31. In the context of advanced data science applications for healthcare, consider a deep learning model trained on electronic health records to predict patient readmission within 30 days. The model uses demographic, clinical, and social determinants of health features. After deployment, the model's performance degrades over time due to changes in patient population and clinical practices. Which of the following approaches is MOST appropriate to maintain model performance while minimizing bias and ensuring regulatory compliance?
Answer: Implement a continuous learning framework with automated retraining triggered by performance drift detection, while monitoring for fairness across subgroups.
Continuous learning with drift detection and fairness monitoring addresses performance degradation while proactively mitigating bias. Retraining from scratch annually (A) may not capture gradual shifts and could introduce bias if not monitored. Post-processing calibration (C) does not adapt to feature distribution changes. Ensemble averaging (D) may improve robustness but does not systematically handle drift or fairness.
32. A data scientist is developing a recommendation system for a large e-commerce platform. They want to incorporate both collaborative filtering and content-based filtering to handle cold-start problems and provide diverse recommendations. Which of the following hybrid approaches is MOST likely to achieve this goal while maintaining scalability for millions of users and items?
Answer: Implement a matrix factorization model that incorporates side information (user and item features) as regularization terms in the objective function.
Matrix factorization with side information effectively blends collaborative and content-based signals, handling cold-start by leveraging features. Weighted linear combination (A) may not capture complex interactions. Two-stage pipeline (B) can suffer from error propagation and limited diversity. End-to-end deep learning (D) requires large amounts of data and may not scale as efficiently as matrix factorization for large-scale systems.
33. In a clinical trial for a new drug, the primary endpoint is binary (response vs. no response). The trial uses a Bayesian adaptive design with a prior that is weakly informative. At an interim analysis, the posterior probability that the drug's response rate exceeds the control rate by at least 10% is 0.85. The trial's stopping boundary for efficacy is a posterior probability of at least 0.95. Which of the following actions is MOST appropriate according to the prespecified adaptive design?
Answer: Continue the trial as planned because the stopping boundary has not been met.
The prespecified stopping boundary requires a posterior probability of at least 0.95; 0.85 does not meet that threshold, so the trial should continue. Stopping early (A) would violate the design. Increasing sample size (C) is not automatically triggered by failing to stop; it would require a futility or sample size re-estimation rule. Modifying the prior post-hoc (D) introduces bias and undermines the Bayesian framework.
36. A data science team is building a model to predict customer churn for a subscription-based service. They have historical data including customer demographics, usage patterns, support interactions, and billing information. The team wants to ensure the model is interpretable to business stakeholders. Which of the following modeling approaches would provide the BEST balance between predictive performance and interpretability for this binary classification task?
Answer: Gradient boosting machine (GBM) with SHAP values for explanation.
GBM often provides high predictive accuracy, and SHAP values offer consistent, local and global interpretability grounded in game theory. Logistic regression (B) is interpretable but may underperform if relationships are complex, and stepwise selection can be unstable. Random forest (C) yields feature importance but not as nuanced as SHAP. SVM with RBF kernel (D) is a black box; LIME provides local explanations but can be unstable and not consistent globally.
37. In a study examining the effect of a new teaching method on student performance, data is collected from multiple schools. Students are nested within classrooms, and classrooms within schools. The outcome is a continuous test score. The researcher wants to estimate the treatment effect while accounting for clustering and potential confounding at the school level. Which of the following statistical models is MOST appropriate?
Answer: Multilevel model (hierarchical linear model) with random intercepts for classrooms and schools, and treatment as a fixed effect.
A multilevel model appropriately handles the nested structure (students in classrooms in schools) and allows for partitioning of variance across levels, providing correct standard errors and the ability to include school-level covariates. OLS with fixed effects (A) would remove between-school variation but ignore classroom clustering. GEE (C) is population-averaged and may not capture the hierarchical random variation. Propensity score matching (D) does not account for clustering, leading to underestimated standard errors.
38. A data analyst is working with a dataset containing missing values in several features. They decide to use multiple imputation (MI) to handle the missing data. After imputation, they fit a linear regression model on each imputed dataset and pool the results using Rubin's rules. Which of the following statements about this approach is TRUE?
Answer: Rubin's rules combine within-imputation and between-imputation variance to produce valid standard errors.
Rubin's rules correctly combine variability from both within and across imputations to yield standard errors that reflect uncertainty due to missing data. MI is valid under the less restrictive missing at random (MAR) assumption, not MCAR (A). Pooled coefficients are the average (B is true but not the best answer; C is more specific to the question about validity). The imputation model can be more general than the analysis model (D is false).
39. Consider a reinforcement learning problem where an agent learns to play a video game. The state space is continuous (screen pixels) and the action space is discrete (e.g., move left, right, jump). The agent uses a deep Q-network (DQN) with experience replay and a target network. Which of the following modifications is MOST likely to improve the stability of learning?
Answer: Implement double DQN to reduce overestimation bias.
Double DQN addresses the overestimation bias of standard DQN by using separate networks for action selection and evaluation, which often stabilizes learning and improves performance. Increasing learning rate (A) can cause instability. RNN (B) may help with partial observability but not directly with stability. Removing the target network (D) would likely increase instability due to correlated updates.
42. In a deep learning model for natural language processing, the attention mechanism computes a weighted sum of values based on query-key similarity. Consider a transformer layer with multi-head attention using 8 heads, each with key dimension d_k = 64. If the input sequence length is 128 and the model dimension d_model = 512, what is the total number of parameters in the multi-head attention layer (excluding biases)?
Answer: 4,194,
Each head has its own weight matrices for queries, keys, and values: W_Q, W_K, W_V of size d_model × d_k = 512 × 64 = 32,768 each, so per head 3 × 32,768 = 98,304. For 8 heads, that's 8 × 98,304 = 786,432. Then the output projection matrix W_O of size d_model × d_model = 512 × 512 = 262,144. Total = 786,432 + 262,144 = 1,048, parameters. However, note that in typical implementations, the query, key, and value projections are concatenated into single matrices of size d_model × 3d_model, which yields 512 × 1536 = 786,432 for QKV, plus the output projection 512 × 512 = 262,144, total 1,048,576. But wait: the question says 'multi-head attention' 'excluding biases' and d_k=64. The correct total is 1,048,576. But the options: A is 4,194,304, B 2,097,152, C 1,048,576, D 8,388,608. So C is 1,048,576. However, check: if we consider each head separately, we have 8 heads * 3 matrices (Q,K,V) each 512x64 = 8332768 = 786432, plus output 512x512=262144, sum=1,048,576. So C is correct. But the answer key says A? Let's re-evaluate: In many implementations, the QKV projections are done as a single linear layer with output size 3d_model = 1536, so weights = 5121536 = 786,432. Then output projection 512512=262,144. Total = 1,048,576. That is option C. So correct is C. But explanation should match.
43. A data scientist is evaluating a binary classifier for a rare disease detection task. The dataset has a 1% prevalence of the positive class. The model achieves 99% sensitivity and 95% specificity. What is the positive predictive value (precision) of this model?
Answer: 0.
Use Bayes' theorem: PPV = (sensitivity * prevalence) / (sensitivity * prevalence + (1 - specificity) * (1 - prevalence)). Sensitivity = 0.99, specificity = 0.95, prevalence = 0.01. Numerator = 0.99*0.01 = 0.0099. Denominator = 0.0099 + (0.05 * 0.99) = 0.0099 +
44. In a distributed computing environment using Apache Spark, a data engineer wants to perform a reduceByKey operation on a large RDD with millions of keys. The cluster has 10 executors, each with 4 cores. The data is partitioned into 100 partitions initially. After a shuffle, how many partitions will the reduceByKey output have by default?
Answer: 100
By default, reduceByKey in Spark uses the number of partitions of the parent RDD, unless explicitly specified. The parent RDD has 100 partitions, so the output will also have 100 partitions. Option B (40) is the total number of cores (10 executors * 4 cores). Option C (10) is the number of executors. Option D (200) is double the partitions, which is not default.
45. A data scientist is training a gradient boosting model (XGBoost) on a dataset with 500 features and 50,000 samples. After tuning, the model uses a learning rate of 0.1, max_depth=6, and 500 trees. The training time is 2 hours. To reduce overfitting while maintaining similar predictive performance, which modification is most effective?
Answer: Set subsample=0.8 and colsample_bytree=0.8.
Subsampling (row and column) introduces randomness and reduces overfitting by making trees less correlated, which is a standard regularization technique in gradient boosting. Option A increases learning rate, which may speed up convergence but can increase overfitting if not carefully tuned. Option C increases depth, which typically increases overfitting. Option D reduces training data, which likely harms performance and does not address overfitting directly.