ALBERTA DATA SCIENTIST EXAM|, Exams of Advanced Data Analysis

ALBERTA DATA SCIENTIST EXAM| QUESTIONS AND CORRECT ANSWERS (VERIFIED ANSWERS) PLUS RATIONALES 2026 Q&A| INSTANTDOWNLOADPDF

Typology: Exams

2025/2026

Available from 04/22/2026

wergnkses254
wergnkses254 🇺🇸

4.4

(8)

5.5K documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ALBERTA DATA SCIENTIST EXAM|
QUESTIONS AND CORRECT ANSWERS
(VERIFIED ANSWERS) PLUS RATIONALES
2026 Q&A | INSTANT DOWNLOAD PDF
Question 1
A model performs well on training data but poorly on unseen data. What
is the most likely issue?
A. Underfitting
B. Overfitting
C. Data leakage
D. Feature scaling
Correct Answer: B
Rationale: Overfitting occurs when a model learns training data too
closely and fails to generalize.
Question 2
What is the primary goal of exploratory data analysis (EDA)?
A. Build final models
B. Understand data structure and patterns
C. Deploy production systems
D. Encrypt datasets
Correct Answer: B
Rationale: EDA helps understand distributions, anomalies, and
relationships in data.
Question 3
Which metric is best for evaluating classification accuracy in imbalanced
datasets?
A. Mean squared error
B. Accuracy only
C. F1-score
D. R-squared
Correct Answer: C
Rationale: F1-score balances precision and recall in imbalanced datasets.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download ALBERTA DATA SCIENTIST EXAM| and more Exams Advanced Data Analysis in PDF only on Docsity!

ALBERTA DATA SCIENTIST EXAM|

QUESTIONS AND CORRECT ANSWERS

(VERIFIED ANSWERS) PLUS RATIONALES

2026 Q&A | INSTANT DOWNLOAD PDF

Question 1 A model performs well on training data but poorly on unseen data. What is the most likely issue? A. Underfitting B. Overfitting C. Data leakage D. Feature scaling Correct Answer: B Rationale: Overfitting occurs when a model learns training data too closely and fails to generalize. Question 2 What is the primary goal of exploratory data analysis (EDA)? A. Build final models B. Understand data structure and patterns C. Deploy production systems D. Encrypt datasets Correct Answer: B Rationale: EDA helps understand distributions, anomalies, and relationships in data. Question 3 Which metric is best for evaluating classification accuracy in imbalanced datasets? A. Mean squared error B. Accuracy only C. F1-score D. R-squared Correct Answer: C Rationale: F1-score balances precision and recall in imbalanced datasets.

Question 4 What does “feature engineering” involve? A. Collecting hardware B. Creating or transforming variables to improve model performance C. Deleting data D. Encrypting datasets Correct Answer: B Rationale: Feature engineering improves model performance by enhancing input variables. Question 5 Which algorithm is best for clustering unlabeled data? A. Linear regression B. K-means clustering C. Logistic regression D. Decision tree classification Correct Answer: B Rationale: K-means groups similar data points without labels. Question 6 What is a confusion matrix used for? A. Regression analysis B. Evaluating classification performance C. Data storage D. Feature selection Correct Answer: B Rationale: It summarizes prediction outcomes (TP, FP, TN, FN). Question 7 What is “data leakage”? A. Missing data B. When test data information is used during training

Question 11 What is supervised learning? A. Learning without labels B. Learning from labeled data C. Random guessing D. Data storage Correct Answer: B Rationale: Supervised learning uses labeled datasets for training. Question 12 What is the role of a validation set? A. Train model B. Tune hyperparameters C. Store raw data D. Encrypt output Correct Answer: B Rationale: Validation data is used for model tuning. Question 13 Which model is best for continuous value prediction? A. Logistic regression B. Linear regression C. KNN classification D. Naive Bayes Correct Answer: B Rationale: Linear regression predicts continuous outcomes. Question 14 What does “bias” in a model indicate? A. Random error B. Systematic error in predictions C. Perfect prediction D. Data size

Correct Answer: B Rationale: Bias refers to error from overly simplistic assumptions. Question 15 What is variance in machine learning? A. Stability of model predictions B. Sensitivity to training data changes C. Data storage size D. Feature count Correct Answer: B Rationale: High variance indicates over-sensitivity to training data. Question 16 What is cross-validation used for? A. Data encryption B. Model performance evaluation on multiple splits C. Data collection D. Feature removal Correct Answer: B Rationale: Cross-validation improves reliability of model evaluation. Question 17 Which algorithm is commonly used for classification problems? A. K-means B. Logistic regression C. PCA D. Linear regression only Correct Answer: B Rationale: Logistic regression is widely used for classification. Question 18 What is PCA used for? A. Increasing features

Question 22 Which technique handles missing data? A. Deletion only B. Imputation C. Random guessing D. Feature removal only Correct Answer: B Rationale: Imputation replaces missing values with estimates. Question 23 What is an outlier? A. Normal data point B. Data point far from distribution C. Label error only D. Feature name Correct Answer: B Rationale: Outliers deviate significantly from other observations. Question 24 What is the purpose of a decision tree? A. Clustering B. Rule-based prediction C. Data storage D. Encryption Correct Answer: B Rationale: Decision trees split data based on feature rules. Question 25 What is overfitting caused by? A. Too little data only B. Excess model complexity C. Feature removal D. Low variance only

Correct Answer: B Rationale: Complex models memorize training data patterns. Question 26 What is underfitting? A. Perfect model B. Model too simple to capture patterns C. Over-complex model D. Data leakage Correct Answer: B Rationale: Underfitting occurs when the model is too simple. Question 27 What is feature scaling used for? A. Increasing dataset size B. Standardizing feature ranges C. Removing labels D. Encrypting data Correct Answer: B Rationale: Scaling ensures features contribute equally to models. Question 28 Which model is best for probability-based classification? A. Naive Bayes B. K-means C. PCA D. Linear regression Correct Answer: A Rationale: Naive Bayes uses probability distributions for classification. Question 29 What is A/B testing used for? A. Data cleaning

Correct Answer: B Rationale: Stratified sampling ensures balanced class representation across splits. Question 33 Which metric is most appropriate for regression evaluation? A. F1-score B. Mean Absolute Error (MAE) C. Accuracy D. Confusion matrix Correct Answer: B Rationale: MAE measures average prediction error in regression tasks. Question 34 What is regularization primarily used for? A. Increasing model complexity B. Reducing overfitting C. Increasing data size D. Improving visualization Correct Answer: B Rationale: Regularization penalizes complexity to improve generalization. Question 35 What does L1 regularization do? A. Increases all weights B. Sets some weights to zero C. Removes data D. Increases variance Correct Answer: B Rationale: L1 (Lasso) can eliminate less important features. Question 36 What does L2 regularization do?

A. Removes features B. Shrinks weights evenly C. Increases bias only D. Deletes data points Correct Answer: B Rationale: L2 (Ridge) reduces weight magnitude evenly. Question 37 What is the main purpose of a ROC curve? A. Data cleaning B. Evaluate classification thresholds C. Feature engineering D. Data storage Correct Answer: B Rationale: ROC curves show trade-offs between sensitivity and specificity. Question 38 What does AUC represent? A. Data size B. Model’s ability to distinguish classes C. Training speed D. Feature count Correct Answer: B Rationale: AUC measures classifier performance across thresholds. Question 39 What is the main assumption of linear regression? A. Non-linear relationship B. Linear relationship between variables C. Random data D. Categorical output Correct Answer: B Rationale: Linear regression assumes linear relationships.

C. Number of datasets D. Number of models Correct Answer: B Rationale: K defines how many clusters are created. Question 44 What is a silhouette score used for? A. Regression evaluation B. Measuring clustering quality C. Data encryption D. Feature scaling Correct Answer: B Rationale: It measures how well clusters are separated. Question 45 What is the curse of dimensionality? A. Too little data B. Performance degradation with many features C. Faster training D. Better accuracy Correct Answer: B Rationale: High-dimensional data reduces model efficiency. Question 46 What is a Bayesian approach in statistics? A. Ignoring probabilities B. Updating beliefs with evidence C. Random guessing D. Data encryption Correct Answer: B Rationale: Bayesian methods update probability estimates with new data.

Question 47 What is a hyperparameter? A. Model output B. Parameter set before training C. Dataset label D. Prediction value Correct Answer: B Rationale: Hyperparameters are configured before training begins. Question 48 What is grid search used for? A. Data cleaning B. Hyperparameter optimization C. Feature deletion D. Visualization Correct Answer: B Rationale: Grid search finds best hyperparameter combinations. Question 49 What is stochastic gradient descent? A. Batch processing B. Optimization using random samples C. Data visualization D. Feature selection Correct Answer: B Rationale: SGD updates model using random subsets of data. Question 50 What is a learning rate? A. Data size B. Step size in optimization C. Model accuracy D. Feature count

B. Predicting future values over time C. Data encryption D. Feature scaling Correct Answer: B Rationale: Time series models predict sequential data trends. Question 55 What is stationarity in time series? A. Changing mean over time B. Constant statistical properties over time C. Random data only D. Missing values Correct Answer: B Rationale: Stationary data has stable mean and variance. Question 56 What is autocorrelation? A. Random error B. Correlation of a signal with itself over time C. Data cleaning D. Feature selection Correct Answer: B Rationale: Autocorrelation measures time-based dependencies. Question 57 What is the main purpose of deep learning? A. Simple regression B. Learning complex patterns using neural networks C. Data storage D. Manual labeling Correct Answer: B Rationale: Deep learning uses layered neural networks for complex tasks.

Question 58 What is a neural network? A. Database system B. Computational model inspired by the brain C. File system D. Query engine Correct Answer: B Rationale: Neural networks simulate interconnected neuron structures. Question 59 What is backpropagation? A. Forward data flow B. Method for updating neural network weights C. Data encryption D. Clustering method Correct Answer: B Rationale: Backpropagation adjusts weights using error gradients. Question 60 What is the primary goal of a data scientist? A. Build hardware systems B. Extract insights and build predictive models from data C. Design UI systems D. Manage servers only Correct Answer: B Rationale: Data scientists analyze data to generate insights and predictions. Question 61 A model performs well during cross-validation but fails in production after deployment. What is the most likely cause? A. Proper training procedure B. Dataset shift (distribution drift) C. Excess features D. High training accuracy

Question 65 What is a key limitation of k-nearest neighbors (KNN)? A. Requires no computation B. Computationally expensive with large datasets C. Works only on regression D. No need for data Correct Answer: B Rationale: KNN slows significantly as dataset size increases. Question 66 What is ensemble learning? A. Single model training B. Combining multiple models for better performance C. Data cleaning method D. Feature scaling technique Correct Answer: B Rationale: Ensemble methods improve accuracy by combining models. Question 67 What is bagging in machine learning? A. Sequential learning B. Training models on random subsets and averaging results C. Feature selection D. Data encryption Correct Answer: B Rationale: Bagging reduces variance by averaging multiple models. Question 68 What is boosting? A. Parallel model training B. Sequential learning focusing on errors C. Data normalization D. Random sampling only

Correct Answer: B Rationale: Boosting improves performance by correcting previous errors. Question 69 What is gradient boosting used for? A. Image storage B. Improving predictive accuracy using sequential models C. Data encryption D. Feature removal Correct Answer: B Rationale: Gradient boosting builds models sequentially to reduce error. Question 70 What is a key risk of using too many features in a model? A. Better generalization B. Overfitting and noise introduction C. Faster training D. Simpler model Correct Answer: B Rationale: Excess features increase complexity and noise. Question 71 What is the purpose of dimensionality reduction? A. Increase dataset size B. Reduce complexity while preserving information C. Encrypt data D. Add noise Correct Answer: B Rationale: It simplifies data while maintaining essential patterns. Question 72 What is t-SNE used for? A. Regression