




















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A set of practice exam questions for machine learning foundations, covering key concepts such as supervised vs. Unsupervised learning, model evaluation metrics, regularization techniques, and clustering algorithms. Each question is accompanied by a detailed explanation of the correct answer, making it a valuable resource for students and professionals preparing for certification or seeking to reinforce their understanding of machine learning principles. The questions cover a range of topics, including linear regression, logistic regression, and recommender systems, offering a comprehensive review of fundamental machine learning concepts. This practice exam is designed to test and enhance your knowledge of machine learning foundations, providing a solid preparation tool for exams and real-world applications. It includes questions on bias-variance tradeoff, cross-validation, and text representation techniques.
Typology: Exams
1 / 92
This page cannot be seen from the preview
Don't miss anything!





















































































Question 1. Which of the following best describes the primary difference between supervised and unsupervised learning? A) Supervised learning uses labeled data, while unsupervised learning does not. B) Supervised learning requires reinforcement signals, while unsupervised learning does not. C) Supervised learning can only be used for classification, while unsupervised learning can only be used for regression. D) Supervised learning always yields higher accuracy than unsupervised learning. Answer: A Explanation: Supervised learning algorithms are trained on input‑output pairs (labels), whereas unsupervised algorithms discover structure from unlabeled data. Question 2. In the typical machine‑learning pipeline, which step comes immediately after feature engineering? A) Model deployment B) Data acquisition C) Model training D) Hyperparameter tuning Answer: C Explanation: Once features are constructed, the next logical step is to train a model using those features. Question 3. Which loss function is most commonly used for training a linear regression model? A) Hinge loss B) Cross‑entropy loss C) Squared error loss D) Absolute error loss
Answer: C Explanation: Linear regression minimizes the sum of squared differences between predicted and actual values. Question 4. When evaluating a regression model, the Root Mean Squared Error (RMSE) is preferred over MAE because RMSE: A) Is less sensitive to outliers. B) Penalizes larger errors more heavily. C) Is always smaller than MAE. D) Does not require square rooting. Answer: B Explanation: RMSE squares the errors before averaging, giving higher weight to large deviations. Question 5. In a multiple linear regression model, adding a polynomial feature (e.g., square footage²) primarily helps to: A) Reduce multicollinearity. B) Capture non‑linear relationships. C) Decrease model complexity. D) Eliminate the need for regularization. Answer: B Explanation: Polynomial terms allow the linear model to fit curved patterns in the data. Question 6. Which of the following is a symptom of overfitting in a regression model? A) Training error and test error are both high. B) Training error is low but test error is high. C) Training error is higher than test error.
B) L1 can set coefficients exactly to zero, L2 cannot. C) L2 always improves model accuracy more than L1. D) L1 is only applicable to classification problems. Answer: B Explanation: L1 adds the absolute value of coefficients to the loss, encouraging sparsity (exact zeros); L2 adds the squared magnitude, shrinking but not eliminating coefficients. Question 10. Which metric is most appropriate for evaluating a binary sentiment classifier when the classes are highly imbalanced? A) Accuracy B) Mean Squared Error C) Area Under the ROC Curve (AUC) D) R² score Answer: C Explanation: AUC measures the ability of the classifier to rank positive instances higher than negatives, independent of class prevalence. Question 11. In logistic regression, the decision boundary is defined by: A) A hyperplane where the predicted probability equals 0.5. B) The set of points with maximum Euclidean distance from the origin. C) The line that maximizes the margin between classes. D) The point where the gradient of the loss is zero. Answer: A Explanation: Logistic regression predicts probabilities; the threshold 0.5 yields the separating hyperplane.
Question 12. Which of the following text representations treats each document as a vector of word occurrence counts? A) TF‑IDF B) Word2Vec embeddings C) Bag‑of‑Words (Count Vector) D) Latent Semantic Indexing Answer: C Explanation: Bag‑of‑Words records raw frequency of each token in a document, ignoring order. Question 13. In a confusion matrix for binary classification, the term “False Positive” refers to: A) An instance correctly predicted as negative. B) An instance incorrectly predicted as positive. C) An instance correctly predicted as positive. D) An instance incorrectly predicted as negative. Answer: B Explanation: A false positive occurs when the model predicts the positive class but the true label is negative. Question 14. Which curve visualizes the trade‑off between true positive rate and false positive rate across different classification thresholds? A) Precision‑Recall curve B) Learning curve C) ROC curve D) Calibration curve Answer: C
D) The distance metric used for high‑dimensional data. Answer: A Explanation: Voronoi cells partition space such that each point belongs to the region of its nearest centroid. Question 18. Which of the following statements about hierarchical clustering is true? A) It requires specifying the number of clusters beforehand. B) It always produces a partitioning identical to k‑means. C) It creates a dendrogram representing nested cluster relationships. D) It cannot handle non‑Euclidean distance metrics. Answer: C Explanation: Hierarchical clustering builds a tree (dendrogram) showing how clusters merge or split at various levels. Question 19. Latent Dirichlet Allocation (LDA) is primarily used for: A) Predicting continuous outcomes. B) Dimensionality reduction of image data. C) Topic modeling in collections of documents. D) Clustering numeric time‑series data. Answer: C Explanation: LDA discovers latent topics in a corpus by modeling each document as a mixture of topics. Question 20. Locality‑Sensitive Hashing (LSH) is advantageous for nearest‑neighbor search because it: A) Guarantees exact nearest neighbors.
B) Reduces dimensionality to a single scalar. C) Provides sub‑linear query time for high‑dimensional data. D) Requires no preprocessing of the dataset. Answer: C Explanation: LSH hashes similar items into the same buckets, enabling fast approximate neighbor retrieval. Question 21. In the context of recommender systems, collaborative filtering relies on: A) Content attributes of items only. B) User‑item interaction patterns. C) Demographic data of users. D) Pre‑trained deep neural networks. Answer: B Explanation: Collaborative filtering predicts preferences based on observed user‑item rating matrices. Question 22. The “cold start” problem in recommender systems refers to difficulty in: A) Scaling the algorithm to millions of users. B) Recommending items to new users or recommending new items. C) Computing similarity between items. D) Updating the model in real time. Answer: B Explanation: With no historical interactions, the system lacks data to generate personalized recommendations. Question 23. Matrix factorization in collaborative filtering seeks to:
Question 26. In deep learning, the term “transfer learning” refers to: A) Training a model from scratch on a new dataset. B. Using a pre‑trained network as a fixed feature extractor for a new task. C. Sharing weights between two unrelated models. D. Converting a convolutional network into a recurrent network. Answer: B Explanation: Transfer learning leverages knowledge learned on a source task (e.g., ImageNet) to improve performance on a target task. Question 27. Which activation function is most commonly used in hidden layers of deep neural networks for image tasks? A) Sigmoid B) Tanh C) ReLU (Rectified Linear Unit) D) Softmax Answer: C Explanation: ReLU mitigates vanishing gradients and is computationally efficient, making it popular for deep vision models. Question 28. In a convolutional neural network (CNN), the operation that reduces spatial dimensions while retaining important features is called: A) Convolution B) Pooling C) Normalization D) Dropout
Answer: B Explanation: Pooling (e.g., max‑pool) aggregates nearby activations, decreasing resolution and providing translational invariance. Question 29. When using deep features extracted from a pre‑trained model for image retrieval, the similarity between two images is typically computed using: A) Cosine similarity on the feature vectors. B) Euclidean distance on raw pixel values. C) Hamming distance on binary hash codes. D) Jaccard index on image tags. Answer: A Explanation: Deep feature vectors are high‑dimensional real numbers; cosine similarity measures angular similarity effectively. Question 30. Which of the following best describes the “black‑box” view of a machine‑learning algorithm? A) The internal parameters are fully interpretable. B) The algorithm’s internal workings are ignored during early experimentation. C) The model is always a linear regression. D) The algorithm requires no data preprocessing. Answer: B Explanation: Treating a model as a black box means focusing on inputs and outputs without inspecting inner mechanisms initially. Question 31. In data preprocessing, why is it important to split the dataset into training and test sets before feature scaling? A) Scaling must be performed only on the test set.
Question 34. Which regularization technique is most suitable when you suspect that only a few features are truly predictive? A) Ridge (L2) regularization B) Lasso (L1) regularization C) Elastic Net (L1 + L2) regularization D) No regularization Answer: B Explanation: L1 regularization tends to zero out irrelevant coefficients, performing feature selection. Question 35. In a support vector machine (SVM) for binary classification, the margin is defined as: A) The distance between the two closest points of opposite classes. B) The sum of distances from all points to the decision boundary. C) The number of support vectors. D) The angle between the weight vector and the bias term. Answer: A Explanation: The margin is the width of the gap between the nearest points (support vectors) of each class; SVM maximizes this margin. Question 36. Which kernel function allows an SVM to learn non‑linear decision boundaries by implicitly mapping data into a higher‑dimensional space? A) Linear kernel B) Polynomial kernel C) RBF (Gaussian) kernel D) Both B and C Answer: D
Explanation: Both polynomial and radial basis function kernels compute inner products in transformed spaces, enabling non‑linear separation. Question 37. In a decision tree, the impurity measure commonly used for classification splits is: A) Mean Squared Error B) Gini impurity C) Silhouette score D) Adjusted R² Answer: B Explanation: Gini impurity quantifies class mixing within a node; minimizing it leads to purer splits. Question 38. Random Forests improve over a single decision tree primarily by: A) Using deeper trees. B) Averaging predictions over many decorrelated trees. C) Applying gradient descent to leaf nodes. D) Pruning each tree to a fixed depth. Answer: B Explanation: Bagging (bootstrap sampling) and feature randomness decorrelate trees; averaging reduces variance. Question 39. Gradient boosting differs from random forests in that it: A) Trains trees sequentially, each correcting errors of the previous. B) Uses only linear base learners. C) Does not require any hyperparameter tuning.
C) Convert documents into binary vectors. D) Encode semantic similarity between words. Answer: B Explanation: IDF down‑weights terms that are frequent across the corpus, emphasizing discriminative words. Question 43. Which distance metric is most appropriate for comparing two TF‑IDF vectors? A) Euclidean distance B) Manhattan distance C) Cosine similarity (converted to distance) D) Hamming distance Answer: C Explanation: Cosine similarity measures the angle between high‑dimensional sparse vectors, effectively capturing similarity of term distributions. Question 44. The “curse of dimensionality” primarily affects which type of algorithm? A) Linear regression B) k‑Nearest Neighbors C) Decision trees D) Naïve Bayes Answer: B Explanation: As dimensionality grows, distances between points become less informative, degrading k‑NN performance. Question 45. In a Gaussian Mixture Model (GMM), each component represents: A) A deterministic cluster centroid.
B) A probability distribution with its own mean and covariance. C) A hard assignment of points to clusters. D) A decision tree leaf. Answer: B Explanation: GMMs model data as a weighted sum of Gaussian distributions, each defined by its own parameters. Question 46. Which of the following best describes the purpose of dropout in deep neural networks? A) To increase the number of parameters. B) To prevent overfitting by randomly omitting neurons during training. C) To accelerate inference time. D) To convert a CNN into a recurrent network. Answer: B Explanation: Dropout randomly deactivates a subset of units each iteration, forcing the network to learn redundant representations. Question 47. When fine‑tuning a pre‑trained convolutional network for a new classification task, the most common practice is to: A) Freeze all layers and only train the final softmax layer. B) Retrain the entire network from random initialization. C) Freeze early layers and fine‑tune later layers plus the classifier head. D. Convert all convolutional layers to fully connected layers. Answer: C Explanation: Early layers capture generic features; keeping them fixed preserves learned low‑level patterns while adapting higher layers to the new domain.
Explanation: A policy defines the agent’s behavior by specifying which action to take in each possible state. Question 51. Which of the following is NOT a typical step in the data acquisition phase of a machine‑learning project? A) Collecting raw logs from sensors. B) Splitting data into training and test sets. C) Accessing public datasets via APIs. D) Performing data cleaning and deduplication. Answer: B Explanation: Data splitting is part of preprocessing/validation, not acquisition. Question 52. When converting categorical variables into numeric form, one‑hot encoding is preferred over label encoding when: A) The categorical variable is ordinal. B) The model can interpret integer values as ordered. C) The variable has many distinct categories and the model is tree‑based. D) The categorical levels have no intrinsic ordering. Answer: D Explanation: One‑hot encoding avoids imposing an artificial order on nominal categories. Question 53. In a regression model, the coefficient associated with a feature represents: A) The change in the target variable for a one‑unit increase in that feature, holding others constant. B) The probability that the feature is important. C) The correlation between the feature and the residuals.
D) The p‑value of the feature’s statistical significance. Answer: A Explanation: Linear regression coefficients quantify the marginal effect of each predictor. Question 54. Which of the following techniques can be used to detect and mitigate overfitting in a deep neural network? A) Increasing the number of hidden layers. B) Using a larger learning rate. C) Early stopping based on validation loss. D) Removing the activation functions. Answer: C Explanation: Early stopping halts training when validation performance stops improving, preventing the model from fitting noise. Question 55. The “bias term” (intercept) in a linear model is necessary because: A) It controls the regularization strength. B) It allows the decision boundary to shift away from the origin. C) It determines the learning rate. D) It normalizes the input features. Answer: B Explanation: The intercept enables the model to fit data where the optimal hyperplane does not pass through the origin. Question 56. In the context of feature scaling, which method preserves the original distribution shape of a feature? A) Min‑max normalization