











































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam certifies end-to-end data science proficiency, covering data collection, preprocessing, exploratory data analysis, statistical modeling, machine learning, deep learning, and data visualization. It evaluates practical skills in working with structured and unstructured data, deploying models, interpreting results, and integrating data science solutions into real-world business environments.
Typology: Exams
1 / 83
This page cannot be seen from the preview
Don't miss anything!












































































Question 1. Which Hadoop component is primarily responsible for distributed storage? A) MapReduce B) YARN C) HDFS D) Hive Answer: C Explanation: HDFS (Hadoop Distributed File System) stores data across multiple nodes. Question 2. In Spark, which abstraction represents an immutable distributed collection of objects? A) DataFrame B) RDD C) Dataset D) Table Answer: B Explanation: RDD (Resilient Distributed Dataset) is the core immutable collection in Spark. Question 3. Which Python library is best suited for out‑of‑core data frames that do not fit into memory? A) Pandas B) NumPy C) Dask D) Matplotlib Answer: C Explanation: Dask provides parallel, out‑of‑core DataFrame operations.
Question 4. When extracting data from a REST API, which HTTP method is typically used to retrieve resources? A) POST B) PUT C) GET D) DELETE Answer: C Explanation: GET requests retrieve data without modifying the server state. Question 5. Which NoSQL database is column‑family based and optimized for write‑heavy workloads? A) MongoDB B) Cassandra C) Redis D) Neo4j Answer: B Explanation: Cassandra stores data in column families and excels at high‑throughput writes. Question 6. In an ETL pipeline, which step is responsible for transforming raw data into a suitable format? A) Extract B) Load C) Transform D) Validate Answer: C Explanation: The Transform stage cleans, aggregates, and reshapes data.
A) t‑test B) Chi‑square test C) ANOVA D) Mann‑Whitney U Answer: C Explanation: ANOVA (Analysis of Variance) assesses differences among multiple group means. Question 11. A p‑value of 0.03 indicates what about the null hypothesis at α = 0.05? A) Fail to reject B) Reject C) Inconclusive D) Accept Answer: B Explanation: Since 0.03 < 0.05, the null hypothesis is rejected. Question 12. Which metric is most appropriate for imbalanced binary classification where the positive class is rare? A) Accuracy B) ROC‑AUC C) Precision D) F1‑Score Answer: D Explanation: F1‑Score balances precision and recall, useful for rare‑class problems. Question 13. In linear regression, the coefficient of determination (R²) measures what? A) Average error magnitude
B) Proportion of variance explained C) Correlation between predictors D) Model complexity Answer: B Explanation: R² indicates how much of the dependent variable’s variance is captured. Question 14. Which loss function is minimized in ordinary least squares regression? A) Mean Absolute Error B) Huber loss C) Mean Squared Error D) Log‑loss Answer: C Explanation: OLS minimizes the sum of squared residuals, i.e., MSE. Question 15. Ridge regression adds which penalty term to the loss function? A) L1 norm B) L2 norm C) Elastic net D) No penalty Answer: B Explanation: Ridge uses an L2 penalty to shrink coefficients. Question 16. Lasso regression can perform variable selection because its penalty can force some coefficients to become exactly what? A) Negative B) Zero
D) Sigmoid Answer: C Explanation: The RBF kernel maps inputs into higher‑dimensional space for non‑linear separation. Question 20. K‑Nearest Neighbors classification predicts the class of a query point based on what? A) Majority class among K closest training points B) Weighted average of all training points C) Decision boundary learned during training D) Probability density estimation Answer: A Explanation: KNN assigns the majority label among the K nearest neighbors. Question 21. Which clustering algorithm requires the number of clusters K to be specified beforehand? A) DBSCAN B) Hierarchical Agglomerative C) K‑Means D) Gaussian Mixture Model Answer: C Explanation: K‑Means partitions data into K predefined clusters. Question 22. In Principal Component Analysis, the first principal component captures the greatest amount of what? A) Noise B) Correlation
C) Variance D) Skewness Answer: C Explanation: It is the direction of maximum variance in the data. Question 23. Which dimensionality‑reduction technique preserves local neighborhood structure rather than global variance? A) PCA B. LDA C. t‑SNE D. ICA Answer: C Explanation: t‑SNE emphasizes local similarities for visualization. Question 24. In a convolutional neural network, the operation that reduces spatial dimensions while preserving depth is called? A. Pooling B. Striding C. Padding D. Activation Answer: A Explanation: Pooling (e.g., max‑pool) down‑samples feature maps. Question 25. Which activation function suffers from vanishing gradients for large positive inputs? A. ReLU B. Sigmoid
D. Tokenization Answer: C Explanation: Stop‑word removal eliminates high‑frequency, low‑information tokens. Question 29. In a Transformer architecture, the mechanism that allows each token to attend to all others is called? A. Convolution B. Recurrence C. Self‑attention D. Pooling Answer: C Explanation: Self‑attention computes pairwise interactions among tokens. Question 30. Which evaluation metric is appropriate for multi‑class classification when classes are imbalanced? A. Macro‑averaged F B. Micro‑averaged Accuracy C. Weighted‑averaged Precision only D. ROC‑AUC (binary only) Answer: A Explanation: Macro‑averaged F1 treats each class equally, mitigating imbalance effects. Question 31. Which hyper‑parameter controls the maximum depth of a decision tree? A. n_estimators B. max_depth C. learning_rate D. min_samples_split
Answer: B Explanation: max_depth limits how deep the tree can grow. Question 32. In XGBoost, the term “boosting rounds” refers to what? A. Number of trees grown sequentially B. Number of features selected per tree C. Number of data partitions D. Number of GPU cores used Answer: A Explanation: Each boosting round adds a new tree to correct residuals. Question 33. Which regularization technique penalizes the absolute sum of coefficients? A. L2 (Ridge) B. L1 (Lasso) C. Elastic Net D. Dropout Answer: B Explanation: L1 regularization (Lasso) adds a penalty proportional to |w|. Question 34. Bayesian Optimization differs from Grid Search primarily by? A. Exhaustively evaluating all combinations B. Randomly sampling hyper‑parameter space C. Building a probabilistic model to select promising points D. Using only integer hyper‑parameters Answer: C
D. Gain chart Answer: B Explanation: The ROC (Receiver Operating Characteristic) curve visualizes TPR vs. FPR. Question 38. In K‑Fold cross‑validation, how many folds are typically used to balance bias and variance? A. 2 B. 5 C. 10 D. 20 Answer: C Explanation: 10‑fold CV is a common compromise between training data size and estimate stability. Question 39. Which metric is NOT appropriate for regression problems? A. RMSE B. MAE C. R² D. Accuracy Answer: D Explanation: Accuracy applies to classification, not continuous predictions. Question 40. Which Python library provides the “Prefect” workflow orchestration tool? A. airflow B. prefect C. luigi D. dagster
Answer: B Explanation: Prefect is a modern orchestration framework for data pipelines. Question 41. In MongoDB, which data structure is used to store documents? A. Table B. Collection C. Bucket D. Index Answer: B Explanation: Collections group BSON documents similar to tables in RDBMS. Question 42. Which SQL clause is used to remove duplicate rows from a query result? A. WHERE B. GROUP BY C. DISTINCT D. HAVING Answer: C Explanation: DISTINCT filters out duplicate records. Question 43. Which Spark operation results in a shuffle across the cluster? A. map B. filter C. reduceByKey D. take Answer: C Explanation: reduceByKey requires data with the same key to be moved to the same executor.
Question 47. Which method reduces dimensionality by maximizing class separability? A. PCA B. LDA C. ICA D. NMF Answer: B Explanation: Linear Discriminant Analysis (LDA) seeks axes that best separate classes. Question 48. Which activation function outputs values in the range (‑1, 1)? A. ReLU B. Sigmoid C. Tanh D. Softmax Answer: C Explanation: Tanh is a scaled sigmoid centered at zero. Question 49. In the context of neural networks, “dropout” is used to mitigate what? A. Vanishing gradients B. Overfitting C. Underfitting D. Data leakage Answer: B Explanation: Dropout randomly deactivates neurons during training to improve generalization. Question 50. Which optimizer adapts learning rates per parameter based on past gradients?
B. Momentum C. Adam D. RMSprop Answer: C Explanation: Adam combines momentum and adaptive learning rates for each weight. Question 51. Which metric is best for ranking models in information retrieval tasks? A. Accuracy B. Mean Reciprocal Rank (MRR) C. F1‑Score D. ROC‑AUC Answer: B Explanation: MRR evaluates the position of the first relevant item in a ranked list. Question 52. In a time‑series split, which technique preserves temporal order while creating train/test folds? A. Random shuffle split B. Stratified K‑Fold C. TimeSeriesSplit D. Group K‑Fold Answer: C Explanation: TimeSeriesSplit respects chronological ordering. Question 53. Which library implements the “LightGBM” gradient boosting framework? A. scikit‑learn
B. Binary cross‑entropy C. Categorical cross‑entropy D. Mean Squared Error Answer: B Explanation: Binary cross‑entropy penalizes deviation between predicted probability and true label. Question 57. In a convolutional layer, the “stride” parameter controls what? A. Number of filters B. Filter size C. Step size of sliding window D. Padding amount Answer: C Explanation: Stride determines how many pixels the filter moves each step. Question 58. Which technique converts categorical variables into binary vectors? A. Label encoding B. One‑hot encoding C. Ordinal encoding D. Frequency encoding Answer: B Explanation: One‑hot creates a separate binary column for each category. Question 59. Which metric quantifies the average absolute difference between predicted and actual values? A. MAE B. MSE
Answer: A Explanation: Mean Absolute Error (MAE) averages absolute residuals. Question 60. Which model is a probabilistic graphical model that assumes conditional independence among features given the class? A. Decision Tree B. Naïve Bayes C. SVM D. K‑Means Answer: B Explanation: Naïve Bayes applies Bayes theorem with a strong independence assumption. Question 61. Which Python package provides the “Plotly” interactive visualization library? A. matplotlib B. seaborn C. plotly D. bokeh Answer: C Explanation: Plotly enables interactive, web‑based charts. Question 62. In Spark SQL, which command registers a DataFrame as a temporary view? A. createOrReplaceTempView() B. registerTempTable() C. cacheTable()