Machine Learning Practice Exam: Linear Models to Deep Learning, Exams of Technology

A practice exam for machine learning, covering topics from linear models to deep learning. It includes multiple-choice questions with detailed explanations, making it a valuable resource for students and professionals preparing for certification or seeking to reinforce their understanding of key concepts in machine learning. The questions cover supervised and unsupervised learning, bias-variance trade-off, pandas, numpy, scikit-learn, linear regression, regularization, logistic regression, k-nearest neighbors, support vector machines, decision trees, ensemble methods, k-means clustering, and dbscan. This practice exam is designed to test and enhance your knowledge of machine learning algorithms and techniques, providing a comprehensive review of essential topics. Use this resource to assess your understanding and improve your skills in machine learning.

Typology: Exams

2025/2026

Available from 12/20/2025

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 93

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning with Python from Linear Models
to Deep Learning Certificate Practice Exam
**Question 1.** Which of the following best describes the difference between supervised and
unsupervised learning?
A) Supervised learning uses labeled data, while unsupervised learning does not.
B) Supervised learning always produces a regression model, unsupervised always a clustering
model.
C) Unsupervised learning requires a test set, supervised learning does not.
D) Supervised learning can only be used for image data.
**Answer:** A
**Explanation:** Supervised learning algorithms are trained on inputoutput pairs (features
with labels). Unsupervised learning works with only input features, discovering structure
without explicit targets.
**Question 2.** In the biasvariance tradeoff, increasing model complexity typically:
A) Increases bias and decreases variance.
B) Decreases bias and increases variance.
C) Decreases both bias and variance.
D) Increases both bias and variance.
**Answer:** B
**Explanation:** More complex models can fit training data better (lower bias) but become
more sensitive to fluctuations in the data (higher variance).
**Question 3.** Which Pandas method would you use to read an Excel file named *sales.xlsx*
into a DataFrame?
A) `pd.read_excel('sales.xlsx')`
B) `pd.read_csv('sales.xlsx')`
C) `pd.load_excel('sales.xlsx')`
D) `pd.open_excel('sales.xlsx')`
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d

Partial preview of the text

Download Machine Learning Practice Exam: Linear Models to Deep Learning and more Exams Technology in PDF only on Docsity!

to Deep Learning Certificate Practice Exam

Question 1. Which of the following best describes the difference between supervised and unsupervised learning? A) Supervised learning uses labeled data, while unsupervised learning does not. B) Supervised learning always produces a regression model, unsupervised always a clustering model. C) Unsupervised learning requires a test set, supervised learning does not. D) Supervised learning can only be used for image data. Answer: A Explanation: Supervised learning algorithms are trained on input‑output pairs (features with labels). Unsupervised learning works with only input features, discovering structure without explicit targets. Question 2. In the bias‑variance trade‑off, increasing model complexity typically: A) Increases bias and decreases variance. B) Decreases bias and increases variance. C) Decreases both bias and variance. D) Increases both bias and variance. Answer: B Explanation: More complex models can fit training data better (lower bias) but become more sensitive to fluctuations in the data (higher variance). Question 3. Which Pandas method would you use to read an Excel file named sales.xlsx into a DataFrame? A) pd.read_excel('sales.xlsx') B) pd.read_csv('sales.xlsx') C) pd.load_excel('sales.xlsx') D) pd.open_excel('sales.xlsx')

to Deep Learning Certificate Practice Exam

Answer: A Explanation: read_excel is the dedicated function for loading Excel files into a DataFrame. Question 4. Given a NumPy array a = np.array([[1,2,3],[4,5,6]]), what is the shape of a[:,1:]? A) (2, 1) B) (2, 2) C) (1, 2) D) (3, 2) Answer: B Explanation: : selects all rows, 1: selects columns 1 and 2 (zero‑based). The resulting sub‑array has 2 rows and 2 columns. Question 5. Which of the following plots is most appropriate for visualizing the distribution of a single continuous variable? A) Scatter plot B) Histogram C) Bar chart D) Box plot Answer: B Explanation: A histogram aggregates continuous values into bins, revealing the shape of the distribution. Question 6. In a correlation matrix, a value of 0.95 between two features suggests: A) Strong positive linear relationship, likely multicollinearity. B) Strong negative linear relationship, likely multicollinearity.

to Deep Learning Certificate Practice Exam

Question 9. When applying L2 regularization (Ridge), increasing the regularization strength $\lambda$ will: A) Shrink coefficients toward zero but never set them exactly to zero. B) Force some coefficients to become exactly zero. C) Increase the model’s variance. D) Remove the intercept term. Answer: A Explanation: Ridge adds $\lambda\sum \beta_j^2$ to the loss, penalizing large weights and shrinking them continuously; it does not produce sparsity. Question 10. Which loss function is used for binary logistic regression? A) Mean Squared Error B) Hinge Loss C) Cross‑Entropy (Log Loss) D) Absolute Error Answer: C Explanation: Logistic regression optimizes the negative log‑likelihood, also called binary cross‑entropy. Question 11. In a multiclass classification problem tackled with the “one‑vs‑rest” strategy, how many binary classifiers are trained? A) One per class (K classifiers) B) One per pair of classes (K·(K‑1)/2 classifiers) C) Only a single classifier D) Two classifiers regardless of K Answer: A

to Deep Learning Certificate Practice Exam

Explanation: OvR builds one binary model for each class, treating that class as positive and all others as negative. Question 12. Which metric is most appropriate when dealing with heavily imbalanced binary classification and the cost of false negatives is high? A) Accuracy B) Precision C) Recall D) F1‑Score Answer: C Explanation: Recall (TP / (TP+FN)) emphasizes correctly identifying the positive class, crucial when missing positives is costly. Question 13. The ROC curve plots: A) Precision vs. Recall B) True Positive Rate vs. False Positive Rate C) Accuracy vs. Threshold D) Loss vs. Epochs Answer: B Explanation: ROC visualizes the trade‑off between sensitivity (TPR) and 1‑specificity (FPR) across thresholds. Question 14. In K‑Nearest Neighbors, which distance metric would be most appropriate for high‑dimensional sparse data? A) Euclidean distance B) Manhattan distance C) Cosine similarity (converted to distance)

to Deep Learning Certificate Practice Exam

B) $-0.7\log_2 0.7 - 0.3\log_2 0.3$ C) $0.7 \times 0.3$ D) $0.7 + 0.3$ Answer: A Explanation: Gini = $1 - \sum p_i^2 = 1 - (0.49+0.09)=0.42$. Question 18. Which ensemble method reduces variance by averaging predictions of many decorrelated trees? A) AdaBoost B) Gradient Boosting C) Random Forest D) Stacking Answer: C Explanation: Random Forest builds many trees on bootstrapped samples with random feature subsets, then averages (regression) or votes (classification) to lower variance. Question 19. In Gradient Boosting, each new tree is trained to predict: A) The residual errors of the previous ensemble. B) The original target variable directly. C) Random noise to improve diversity. D) The class probabilities from the previous tree. Answer: A Explanation: Boosting sequentially fits trees to the negative gradient of the loss (i.e., residuals), correcting mistakes of earlier models.

to Deep Learning Certificate Practice Exam

Question 20. Which hyperparameter of XGBoost primarily controls model complexity and helps prevent over‑fitting? A) learning_rate (eta) B) max_depth C) subsample D) All of the above Answer: D Explanation: All listed parameters influence complexity: learning_rate shrinks updates, max_depth limits tree depth, and subsample reduces data per tree. Question 21. The elbow method for K‑Means clustering evaluates which of the following to determine the optimal number of clusters? A) Silhouette coefficient B) Within‑cluster sum of squares (WCSS) C) Between‑cluster variance D) Average distance to nearest neighbor Answer: B Explanation: Plotting WCSS vs. K often shows a “knee” where adding more clusters yields diminishing reduction in inertia. Question 22. In hierarchical agglomerative clustering, the “linkage” parameter determines: A) How many clusters are formed initially. B) The distance metric used between points. C) How the distance between clusters is computed (single, complete, average, ward). D) Whether the algorithm runs bottom‑up or top‑down. Answer: C

to Deep Learning Certificate Practice Exam

D) The number of retained components equals the original feature count. Answer: A Explanation: PCA components are orthogonal by construction, thus uncorrelated. Retaining fewer components introduces reconstruction error and may or may not help downstream tasks. Question 26. t‑SNE is primarily used for: A) Reducing dimensionality for downstream modeling. B) Visualizing high‑dimensional data in 2‑ or 3‑D while preserving local structure. C) Performing clustering directly. D) Linear feature extraction. Answer: B Explanation: t‑SNE maps high‑dimensional points to a low‑dimensional space, emphasizing local neighbor relationships for visualization. Question 27. Which cross‑validation strategy is most appropriate when the dataset is highly imbalanced and you need to preserve class distribution in each fold? A) K‑Fold B) Stratified K‑Fold C) Leave‑One‑Out D. Shuffle‑Split Answer: B Explanation: Stratified K‑Fold ensures each fold mirrors the overall class proportions, reducing variance in performance estimates for imbalanced data. Question 28. In a hold‑out validation scheme with a 70/30 train‑test split, which of the following statements is correct? A) The test set is used for hyperparameter tuning.

to Deep Learning Certificate Practice Exam

B) The model sees the test data during training. C) The test set provides an unbiased estimate of final performance. D) The training set must be further split into validation and test sets. Answer: C Explanation: The test set is kept completely unseen until after training and any hyperparameter tuning, providing an unbiased performance estimate. Question 29. Grid Search differs from Random Search in hyperparameter optimization by: A) Trying every possible combination within a predefined grid. B) Sampling random combinations from a continuous distribution. C) Being faster for high‑dimensional spaces. D) Using Bayesian inference. Answer: A Explanation: Grid Search exhaustively evaluates each point on a discretized hyperparameter grid, whereas Random Search samples randomly. Question 30. Which Scikit‑learn function can be used to automatically split a dataset into training and test subsets while preserving the random state? A) train_test_split B) cross_val_score C) GridSearchCV D) StratifiedShuffleSplit Answer: A Explanation: train_test_split partitions arrays or matrices into random train and test subsets; the random_state argument ensures reproducibility.

to Deep Learning Certificate Practice Exam

Explanation: One‑hot encoding creates a binary column for each distinct category (unless you drop one to avoid collinearity). Question 34. In a regression context, which metric is most robust to outliers? A) Mean Squared Error (MSE) B) Root Mean Squared Error (RMSE) C) Mean Absolute Error (MAE) D) R‑squared Answer: C Explanation: MAE uses absolute differences, penalizing errors linearly, whereas MSE squares errors, giving disproportionate weight to outliers. Question 35. The “learning curve” of a model plots: A) Training and validation error vs. number of epochs or training size. B) Feature importance vs. feature index. C) Loss vs. regularization strength. D) Accuracy vs. number of hidden layers. Answer: A Explanation: Learning curves show how error changes with more data or training iterations, helping diagnose bias‑variance issues. Question 36. Which of the following is a correct statement about the “softmax” function used in multiclass logistic regression? A) It outputs values that sum to 1, representing a probability distribution over classes. B) It is identical to the sigmoid function for binary classification. C) It can only be applied to linear models without hidden layers.

to Deep Learning Certificate Practice Exam

D) It always produces sparse outputs (many zeros). Answer: A Explanation: Softmax exponentiates each class score and normalizes by the sum, yielding a valid probability distribution. Question 37. In a Random Forest, the max_features hyperparameter controls: A) The maximum depth of each tree. B) The number of trees in the forest. C) The number of features considered when looking for the best split. D) The minimum number of samples required to split a node. Answer: C Explanation: max_features determines how many predictors are randomly selected at each split, promoting tree decorrelation. Question 38. Which of the following statements about the “bagging” ensemble technique is true? A) It builds models sequentially, each correcting its predecessor. B) It reduces bias but not variance. C) Each base learner is trained on a bootstrap sample of the data. D) It can only be used with decision trees. Answer: C Explanation: Bagging (Bootstrap Aggregating) trains each model on a randomly drawn with‑replacement subset of the training data. Question 39. When applying StandardScaler to a training set, you must: A) Fit the scaler on the training data and transform both train and test with the same parameters.

to Deep Learning Certificate Practice Exam

Question 42. Which of the following is a disadvantage of using a very high learning rate in gradient descent? A) Convergence may be slower. B) The algorithm may overshoot minima and diverge. C) It guarantees reaching the global optimum. D) It reduces computational cost per iteration. Answer: B Explanation: A large step size can cause the loss to bounce around or increase, preventing convergence. Question 43. In Scikit‑learn, the Pipeline class is useful because it: A) Automatically selects the best algorithm. B) Chains preprocessing steps with an estimator, ensuring they are applied consistently. C) Performs hyperparameter optimization. D) Generates synthetic data. Answer: B Explanation: Pipeline bundles transformers and a final estimator, applying them sequentially during fit and predict, which avoids data leakage. Question 44. When using the np.where function on a boolean array, the result is: A) Indices of True elements. B) The original array unchanged. C) A copy of the array with False values set to zero. D) A flattened version of the array. Answer: A

to Deep Learning Certificate Practice Exam

Explanation: np.where(condition) returns the tuple of indices where the condition holds true. Question 45. Which of the following statements about the “bias” term in a linear model is correct? A) It is multiplied by the input feature. B) It allows the regression line to be shifted away from the origin. C) It is always set to zero in scikit‑learn. D) It is equivalent to the regularization parameter. Answer: B Explanation: The bias (intercept) adds a constant offset, enabling the fitted line to intersect the y‑axis at a non‑zero value. Question 46. The “curse of dimensionality” primarily affects which of the following algorithms? A) Decision Trees B) K‑Nearest Neighbors C) Linear Regression D) Naïve Bayes Answer: B Explanation: K‑NN relies on distance calculations; as dimensions increase, distances become less discriminative, degrading performance. Question 47. In the context of model interpretability, SHAP values provide: A) Global feature importance only. B) Local explanations of how each feature contributed to a single prediction. C) A method to reduce dimensionality.

to Deep Learning Certificate Practice Exam

A) KNeighborsClassifier B) LinearRegression C) DecisionTreeClassifier D) DecisionTreeRegressor Answer: None of the above – each estimator is specific to a task. Explanation: Scikit‑learn separates classifiers and regressors; you must choose the appropriate class for the problem type. Question 51. When performing feature scaling with MinMaxScaler, the transformed value of a feature originally equal to the minimum of the training set becomes: A) 0 B) 0. C) 1 D) Depends on the feature’s variance. Answer: A Explanation: MinMaxScaler linearly maps the minimum to 0 and the maximum to 1. Question 52. Which of the following is a valid reason to drop a feature before modeling? A) It has a high Pearson correlation with the target. B) It contains many missing values that cannot be reliably imputed. C) It is a categorical variable with many levels. D) It improves model interpretability, regardless of predictive power. Answer: B Explanation: Features with excessive missingness may introduce noise; dropping them can be safer than unreliable imputation.

to Deep Learning Certificate Practice Exam

Question 53. In the context of regularization, the term “shrinkage” refers to: A) Reducing the number of training samples. B) Decreasing the magnitude of model coefficients towards zero. C) Removing features entirely. D) Decreasing the learning rate during training. Answer: B Explanation: Regularization penalties (L1, L2) encourage smaller coefficient values, effectively “shrinking” them. Question 54. Which of the following is a characteristic of the Lasso (L1) regularization? A) It always improves model accuracy. B) It can produce sparse models by setting some coefficients exactly to zero. C) It penalizes the square of the coefficients. D) It is equivalent to Ridge regression when λ → 0. Answer: B Explanation: L1 penalty adds the absolute value of coefficients, leading to sparsity (feature selection). Question 55. In a binary classification problem with highly imbalanced classes, which technique can help improve model training? A) Using accuracy as the loss function. B) Oversampling the minority class (e.g., SMOTE). C) Reducing the number of features. D) Increasing the learning rate. Answer: B