Machine Learning Foundations: Practice Exam Questions, Exams of Technology

A set of practice exam questions for machine learning foundations, covering key concepts such as supervised vs. Unsupervised learning, model evaluation metrics, regularization techniques, and clustering algorithms. Each question is accompanied by a detailed explanation of the correct answer, making it a valuable resource for students and professionals preparing for certification or seeking to reinforce their understanding of machine learning principles. The questions cover a range of topics, including linear regression, logistic regression, and recommender systems, offering a comprehensive review of fundamental machine learning concepts. This practice exam is designed to test and enhance your knowledge of machine learning foundations, providing a solid preparation tool for exams and real-world applications. It includes questions on bias-variance tradeoff, cross-validation, and text representation techniques.

Typology: Exams

2025/2026

Available from 12/20/2025

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 92

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning Foundations A Case Study
Approach Certificate Practice Exam
**Question 1.** Which of the following best describes the primary difference between
supervised and unsupervised learning?
A) Supervised learning uses labeled data, while unsupervised learning does not.
B) Supervised learning requires reinforcement signals, while unsupervised learning does not.
C) Supervised learning can only be used for classification, while unsupervised learning can only
be used for regression.
D) Supervised learning always yields higher accuracy than unsupervised learning.
Answer: A
Explanation: Supervised learning algorithms are trained on inputoutput pairs (labels), whereas
unsupervised algorithms discover structure from unlabeled data.
**Question 2.** In the typical machinelearning pipeline, which step comes immediately after
feature engineering?
A) Model deployment
B) Data acquisition
C) Model training
D) Hyperparameter tuning
Answer: C
Explanation: Once features are constructed, the next logical step is to train a model using those
features.
**Question 3.** Which loss function is most commonly used for training a linear regression
model?
A) Hinge loss
B) Crossentropy loss
C) Squared error loss
D) Absolute error loss
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c

Partial preview of the text

Download Machine Learning Foundations: Practice Exam Questions and more Exams Technology in PDF only on Docsity!

Approach Certificate Practice Exam

Question 1. Which of the following best describes the primary difference between supervised and unsupervised learning? A) Supervised learning uses labeled data, while unsupervised learning does not. B) Supervised learning requires reinforcement signals, while unsupervised learning does not. C) Supervised learning can only be used for classification, while unsupervised learning can only be used for regression. D) Supervised learning always yields higher accuracy than unsupervised learning. Answer: A Explanation: Supervised learning algorithms are trained on input‑output pairs (labels), whereas unsupervised algorithms discover structure from unlabeled data. Question 2. In the typical machine‑learning pipeline, which step comes immediately after feature engineering? A) Model deployment B) Data acquisition C) Model training D) Hyperparameter tuning Answer: C Explanation: Once features are constructed, the next logical step is to train a model using those features. Question 3. Which loss function is most commonly used for training a linear regression model? A) Hinge loss B) Cross‑entropy loss C) Squared error loss D) Absolute error loss

Approach Certificate Practice Exam

Answer: C Explanation: Linear regression minimizes the sum of squared differences between predicted and actual values. Question 4. When evaluating a regression model, the Root Mean Squared Error (RMSE) is preferred over MAE because RMSE: A) Is less sensitive to outliers. B) Penalizes larger errors more heavily. C) Is always smaller than MAE. D) Does not require square rooting. Answer: B Explanation: RMSE squares the errors before averaging, giving higher weight to large deviations. Question 5. In a multiple linear regression model, adding a polynomial feature (e.g., square footage²) primarily helps to: A) Reduce multicollinearity. B) Capture non‑linear relationships. C) Decrease model complexity. D) Eliminate the need for regularization. Answer: B Explanation: Polynomial terms allow the linear model to fit curved patterns in the data. Question 6. Which of the following is a symptom of overfitting in a regression model? A) Training error and test error are both high. B) Training error is low but test error is high. C) Training error is higher than test error.

Approach Certificate Practice Exam

B) L1 can set coefficients exactly to zero, L2 cannot. C) L2 always improves model accuracy more than L1. D) L1 is only applicable to classification problems. Answer: B Explanation: L1 adds the absolute value of coefficients to the loss, encouraging sparsity (exact zeros); L2 adds the squared magnitude, shrinking but not eliminating coefficients. Question 10. Which metric is most appropriate for evaluating a binary sentiment classifier when the classes are highly imbalanced? A) Accuracy B) Mean Squared Error C) Area Under the ROC Curve (AUC) D) R² score Answer: C Explanation: AUC measures the ability of the classifier to rank positive instances higher than negatives, independent of class prevalence. Question 11. In logistic regression, the decision boundary is defined by: A) A hyperplane where the predicted probability equals 0.5. B) The set of points with maximum Euclidean distance from the origin. C) The line that maximizes the margin between classes. D) The point where the gradient of the loss is zero. Answer: A Explanation: Logistic regression predicts probabilities; the threshold 0.5 yields the separating hyperplane.

Approach Certificate Practice Exam

Question 12. Which of the following text representations treats each document as a vector of word occurrence counts? A) TF‑IDF B) Word2Vec embeddings C) Bag‑of‑Words (Count Vector) D) Latent Semantic Indexing Answer: C Explanation: Bag‑of‑Words records raw frequency of each token in a document, ignoring order. Question 13. In a confusion matrix for binary classification, the term “False Positive” refers to: A) An instance correctly predicted as negative. B) An instance incorrectly predicted as positive. C) An instance correctly predicted as positive. D) An instance incorrectly predicted as negative. Answer: B Explanation: A false positive occurs when the model predicts the positive class but the true label is negative. Question 14. Which curve visualizes the trade‑off between true positive rate and false positive rate across different classification thresholds? A) Precision‑Recall curve B) Learning curve C) ROC curve D) Calibration curve Answer: C

Approach Certificate Practice Exam

D) The distance metric used for high‑dimensional data. Answer: A Explanation: Voronoi cells partition space such that each point belongs to the region of its nearest centroid. Question 18. Which of the following statements about hierarchical clustering is true? A) It requires specifying the number of clusters beforehand. B) It always produces a partitioning identical to k‑means. C) It creates a dendrogram representing nested cluster relationships. D) It cannot handle non‑Euclidean distance metrics. Answer: C Explanation: Hierarchical clustering builds a tree (dendrogram) showing how clusters merge or split at various levels. Question 19. Latent Dirichlet Allocation (LDA) is primarily used for: A) Predicting continuous outcomes. B) Dimensionality reduction of image data. C) Topic modeling in collections of documents. D) Clustering numeric time‑series data. Answer: C Explanation: LDA discovers latent topics in a corpus by modeling each document as a mixture of topics. Question 20. Locality‑Sensitive Hashing (LSH) is advantageous for nearest‑neighbor search because it: A) Guarantees exact nearest neighbors.

Approach Certificate Practice Exam

B) Reduces dimensionality to a single scalar. C) Provides sub‑linear query time for high‑dimensional data. D) Requires no preprocessing of the dataset. Answer: C Explanation: LSH hashes similar items into the same buckets, enabling fast approximate neighbor retrieval. Question 21. In the context of recommender systems, collaborative filtering relies on: A) Content attributes of items only. B) User‑item interaction patterns. C) Demographic data of users. D) Pre‑trained deep neural networks. Answer: B Explanation: Collaborative filtering predicts preferences based on observed user‑item rating matrices. Question 22. The “cold start” problem in recommender systems refers to difficulty in: A) Scaling the algorithm to millions of users. B) Recommending items to new users or recommending new items. C) Computing similarity between items. D) Updating the model in real time. Answer: B Explanation: With no historical interactions, the system lacks data to generate personalized recommendations. Question 23. Matrix factorization in collaborative filtering seeks to:

Approach Certificate Practice Exam

Question 26. In deep learning, the term “transfer learning” refers to: A) Training a model from scratch on a new dataset. B. Using a pre‑trained network as a fixed feature extractor for a new task. C. Sharing weights between two unrelated models. D. Converting a convolutional network into a recurrent network. Answer: B Explanation: Transfer learning leverages knowledge learned on a source task (e.g., ImageNet) to improve performance on a target task. Question 27. Which activation function is most commonly used in hidden layers of deep neural networks for image tasks? A) Sigmoid B) Tanh C) ReLU (Rectified Linear Unit) D) Softmax Answer: C Explanation: ReLU mitigates vanishing gradients and is computationally efficient, making it popular for deep vision models. Question 28. In a convolutional neural network (CNN), the operation that reduces spatial dimensions while retaining important features is called: A) Convolution B) Pooling C) Normalization D) Dropout

Approach Certificate Practice Exam

Answer: B Explanation: Pooling (e.g., max‑pool) aggregates nearby activations, decreasing resolution and providing translational invariance. Question 29. When using deep features extracted from a pre‑trained model for image retrieval, the similarity between two images is typically computed using: A) Cosine similarity on the feature vectors. B) Euclidean distance on raw pixel values. C) Hamming distance on binary hash codes. D) Jaccard index on image tags. Answer: A Explanation: Deep feature vectors are high‑dimensional real numbers; cosine similarity measures angular similarity effectively. Question 30. Which of the following best describes the “black‑box” view of a machine‑learning algorithm? A) The internal parameters are fully interpretable. B) The algorithm’s internal workings are ignored during early experimentation. C) The model is always a linear regression. D) The algorithm requires no data preprocessing. Answer: B Explanation: Treating a model as a black box means focusing on inputs and outputs without inspecting inner mechanisms initially. Question 31. In data preprocessing, why is it important to split the dataset into training and test sets before feature scaling? A) Scaling must be performed only on the test set.

Approach Certificate Practice Exam

Question 34. Which regularization technique is most suitable when you suspect that only a few features are truly predictive? A) Ridge (L2) regularization B) Lasso (L1) regularization C) Elastic Net (L1 + L2) regularization D) No regularization Answer: B Explanation: L1 regularization tends to zero out irrelevant coefficients, performing feature selection. Question 35. In a support vector machine (SVM) for binary classification, the margin is defined as: A) The distance between the two closest points of opposite classes. B) The sum of distances from all points to the decision boundary. C) The number of support vectors. D) The angle between the weight vector and the bias term. Answer: A Explanation: The margin is the width of the gap between the nearest points (support vectors) of each class; SVM maximizes this margin. Question 36. Which kernel function allows an SVM to learn non‑linear decision boundaries by implicitly mapping data into a higher‑dimensional space? A) Linear kernel B) Polynomial kernel C) RBF (Gaussian) kernel D) Both B and C Answer: D

Approach Certificate Practice Exam

Explanation: Both polynomial and radial basis function kernels compute inner products in transformed spaces, enabling non‑linear separation. Question 37. In a decision tree, the impurity measure commonly used for classification splits is: A) Mean Squared Error B) Gini impurity C) Silhouette score D) Adjusted R² Answer: B Explanation: Gini impurity quantifies class mixing within a node; minimizing it leads to purer splits. Question 38. Random Forests improve over a single decision tree primarily by: A) Using deeper trees. B) Averaging predictions over many decorrelated trees. C) Applying gradient descent to leaf nodes. D) Pruning each tree to a fixed depth. Answer: B Explanation: Bagging (bootstrap sampling) and feature randomness decorrelate trees; averaging reduces variance. Question 39. Gradient boosting differs from random forests in that it: A) Trains trees sequentially, each correcting errors of the previous. B) Uses only linear base learners. C) Does not require any hyperparameter tuning.

Approach Certificate Practice Exam

C) Convert documents into binary vectors. D) Encode semantic similarity between words. Answer: B Explanation: IDF down‑weights terms that are frequent across the corpus, emphasizing discriminative words. Question 43. Which distance metric is most appropriate for comparing two TF‑IDF vectors? A) Euclidean distance B) Manhattan distance C) Cosine similarity (converted to distance) D) Hamming distance Answer: C Explanation: Cosine similarity measures the angle between high‑dimensional sparse vectors, effectively capturing similarity of term distributions. Question 44. The “curse of dimensionality” primarily affects which type of algorithm? A) Linear regression B) k‑Nearest Neighbors C) Decision trees D) Naïve Bayes Answer: B Explanation: As dimensionality grows, distances between points become less informative, degrading k‑NN performance. Question 45. In a Gaussian Mixture Model (GMM), each component represents: A) A deterministic cluster centroid.

Approach Certificate Practice Exam

B) A probability distribution with its own mean and covariance. C) A hard assignment of points to clusters. D) A decision tree leaf. Answer: B Explanation: GMMs model data as a weighted sum of Gaussian distributions, each defined by its own parameters. Question 46. Which of the following best describes the purpose of dropout in deep neural networks? A) To increase the number of parameters. B) To prevent overfitting by randomly omitting neurons during training. C) To accelerate inference time. D) To convert a CNN into a recurrent network. Answer: B Explanation: Dropout randomly deactivates a subset of units each iteration, forcing the network to learn redundant representations. Question 47. When fine‑tuning a pre‑trained convolutional network for a new classification task, the most common practice is to: A) Freeze all layers and only train the final softmax layer. B) Retrain the entire network from random initialization. C) Freeze early layers and fine‑tune later layers plus the classifier head. D. Convert all convolutional layers to fully connected layers. Answer: C Explanation: Early layers capture generic features; keeping them fixed preserves learned low‑level patterns while adapting higher layers to the new domain.

Approach Certificate Practice Exam

Explanation: A policy defines the agent’s behavior by specifying which action to take in each possible state. Question 51. Which of the following is NOT a typical step in the data acquisition phase of a machine‑learning project? A) Collecting raw logs from sensors. B) Splitting data into training and test sets. C) Accessing public datasets via APIs. D) Performing data cleaning and deduplication. Answer: B Explanation: Data splitting is part of preprocessing/validation, not acquisition. Question 52. When converting categorical variables into numeric form, one‑hot encoding is preferred over label encoding when: A) The categorical variable is ordinal. B) The model can interpret integer values as ordered. C) The variable has many distinct categories and the model is tree‑based. D) The categorical levels have no intrinsic ordering. Answer: D Explanation: One‑hot encoding avoids imposing an artificial order on nominal categories. Question 53. In a regression model, the coefficient associated with a feature represents: A) The change in the target variable for a one‑unit increase in that feature, holding others constant. B) The probability that the feature is important. C) The correlation between the feature and the residuals.

Approach Certificate Practice Exam

D) The p‑value of the feature’s statistical significance. Answer: A Explanation: Linear regression coefficients quantify the marginal effect of each predictor. Question 54. Which of the following techniques can be used to detect and mitigate overfitting in a deep neural network? A) Increasing the number of hidden layers. B) Using a larger learning rate. C) Early stopping based on validation loss. D) Removing the activation functions. Answer: C Explanation: Early stopping halts training when validation performance stops improving, preventing the model from fitting noise. Question 55. The “bias term” (intercept) in a linear model is necessary because: A) It controls the regularization strength. B) It allows the decision boundary to shift away from the origin. C) It determines the learning rate. D) It normalizes the input features. Answer: B Explanation: The intercept enables the model to fit data where the optimal hyperplane does not pass through the origin. Question 56. In the context of feature scaling, which method preserves the original distribution shape of a feature? A) Min‑max normalization