CS434 – Final Exam Prep: Machine Learning and Data Mining., Exams of Mechanical Engineering

CS434 – Final Exam Prep: Machine Learning and Data Mining.

Typology: Exams

2024/2025

Available from 06/27/2025

Martin-Ray-1
Martin-Ray-1 🇺🇸

4.7

(12)

9.9K documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 | P a g e
CS434 – Final Exam Prep: Machine
Learning and Data Mining.
1. Softmax function - Answers ✅✅ Takes in a d-dimensional vector of exponentiated activations
which contains the activation for each of d classes, with each class activation divided by the
sum of all d activations. This outputs a normalized d-dimensional vector such that all elements
sum to one.
2. In multiclass logistic regression, the total number of learned weight vectors for a C class
problem. - Answers ✅✅ C or C-1
3. Artificial neurons - Answers ✅✅ Behave very differently than their biological counterpart
4. Capability of single-activation neuron - Answers ✅✅ Can represent non-linear boundaries in
classification problems.
5. ReLUs - Answers ✅✅ Tends to converge fast because they do not saturate for positive inputs
like Sigmoid.
6. Loss function - Answers ✅✅ Measures how big an error predicted by a network is from the
ground truth. Generally decreases when plotted versus number of epochs when training a
neural network.
7. Jacobian - Answers ✅✅ A matrix filled with partial derivatives of each dimension of vector v
with respect to each dimension of vector u. Thus, it's a high-dimensional gradient.
8. Backpropagation - Answers ✅✅ An efficient way to compute gradients of the loss with
respect to model parameters. Firstly, it is a reverse-mode differentiation and computes the
product of intermediate Jacobians from the output of the network backwards -- reducing cost
of matrix multiplication in computing gradients. Secondly and more important, the backwards
pass allows us to store and reuse loss gradients as we work our way backwards through the
network.
9. Neural networks - Answers ✅✅ Universal approximators of finite size
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download CS434 – Final Exam Prep: Machine Learning and Data Mining. and more Exams Mechanical Engineering in PDF only on Docsity!

CS434 – Final Exam Prep: Machine

Learning and Data Mining.

  1. Softmax function - Answers ✅✅ Takes in a d-dimensional vector of exponentiated activations which contains the activation for each of d classes, with each class activation divided by the sum of all d activations. This outputs a normalized d-dimensional vector such that all elements sum to one.
  2. In multiclass logistic regression, the total number of learned weight vectors for a C class problem. - Answers ✅✅ C or C-
  3. Artificial neurons - Answers ✅✅ Behave very differently than their biological counterpart
  4. Capability of single-activation neuron - Answers ✅✅ Can represent non-linear boundaries in classification problems.
  5. ReLUs - Answers ✅✅ Tends to converge fast because they do not saturate for positive inputs like Sigmoid.
  6. Loss function - Answers ✅✅ Measures how big an error predicted by a network is from the ground truth. Generally decreases when plotted versus number of epochs when training a neural network.
  7. Jacobian - Answers ✅✅ A matrix filled with partial derivatives of each dimension of vector v with respect to each dimension of vector u. Thus, it's a high-dimensional gradient.
  8. Backpropagation - Answers ✅✅ An efficient way to compute gradients of the loss with respect to model parameters. Firstly, it is a reverse-mode differentiation and computes the product of intermediate Jacobians from the output of the network backwards -- reducing cost of matrix multiplication in computing gradients. Secondly and more important, the backwards pass allows us to store and reuse loss gradients as we work our way backwards through the network.
  9. Neural networks - Answers ✅✅ Universal approximators of finite size
  1. Computational graph - Answers ✅✅ A directed acyclic graph with node corresponding to units of computation and edges corresponding to the results of these computations. If each node implements both a forward and backward (i.e. computing a Jacobian of output with respect to input) operation, then the graph can be used to calculate derivatives of the output of arbitrary combinations of operations via backpropagation.
  2. Locality assumption of convolutional neural networks - Answers ✅✅ Relevant features in an input can be found by examining local regions of the input -- e.g. windows of time in a sequence or regions of an image.
  3. Decision trees - Answers ✅✅ Tree of nodes, where internal nodes implement tests of attributes to divide datapoints and leaves determine labels to be predicted.
  4. Why should decision tree learning not terminate whenever all all attributes result in zero information gain? - Answers ✅✅ Many functions (like XOR) may not show any information gain in initial variables but be able to be usefully split later only after multiple variables have been considered.
  5. Tasks in unsupervised learning - Answers ✅✅ Density estimation, clustering, and dimensionality reduction
  6. Single-Link function - Answers ✅✅ Able to generate very long clusters because it only considers the minimum distance between points in two clusters for merging.
  7. The height of a joint in a Dendrogram - Answers ✅✅ The distance between two merged clusters.
  8. How does multiclass classification differ from binary classification? - Answers ✅✅ Binary classification is a special class of multiclass classification when the number of classes equals two.
  9. What is a one-vs-all classifier? - Answers ✅✅ A binary classifier that learns a decision boundary separating one class (the class of interest) from all other classes.
  1. What other methods can do multiclass classification directly? - Answers ✅✅ kNN, neural networks, multiclass SVMs, multiclass perceptrons, decision trees, naive Baye
  2. What is a neuron? - Answers ✅✅ A linear function of its input followed by a (typically) nonlinear activation function producing output.
  3. What are the hyperparameter(s) of a single neuron? - Answers ✅✅ The activation function, σ
  4. What learnable parameters does a single neuron have? - Answers ✅✅ w ∈ R^d and b ∈ R
  5. What do we do to form a neural network? - Answers ✅✅ Stack multiple layers of neurons together
  6. What type of neural network did we study in depth? - Answers ✅✅ Multilayer, feedforward neural networks
  7. What activation do we use in the output layer for regression? - Answers ✅✅ Linear activation
  8. What activation do we use in the output layer for classification? - Answers ✅✅ Sigmoid or softmax
  9. How do we compute the output for a single neuron? - Answers ✅✅ Take the activation function of the summed, weighted input plus bias. Thus, simply compute σ ( Σ i (wi*xj) + bi)
  10. How do we train a NN? - Answers ✅✅ Forward pass, backward pass, then update.
  11. What happens in a forward pass? - Answers ✅✅ For each training example, (i) compute and store all activations, then (ii) compute loss
  12. What happens in a backward pass? - Answers ✅✅ Compute gradient of the loss w.r.t. all network parameters. Perform this efficiently using backpropagation
  1. What happens during the update step? - Answers ✅✅ A step of gradient descent is performed to minimize loss
  2. What can we use NNs for? - Answers ✅✅ Learning nonlinear functions, learning theoretically universal approximators to arbitrarily low error ε , and they're flexible enough to be used for classification or regression
  3. What is a loss function? - Answers ✅✅ Some function measuring how bad a network's output is relative to optimal performance
  4. What sorts of loss functions are used for regression? - Answers ✅✅ Squared error (most common), absolute error, or Huber error
  5. What sorts of loss functions are used for classification? - Answers ✅✅ Cross entropy or hinge error
  6. What observations does backpropagation use to be so effective? - Answers ✅✅ 1) NNs generally reduce dimensionality: high-->low. 2) We shouldn't recompute something we already know (memoization via DP)
  7. When do we use forward-mode differentiation? - Answers ✅✅ When the input dimension is much smaller than the output dimension
  8. When do we use reverse-mode differentiation? - Answers ✅✅ When the dimension of output is much smaller than the dimension of the input
  9. Where do we compute loss? - Answers ✅✅ At the very END of the neural network (output)
  10. What is a computational graph? - Answers ✅✅ A directed acyclic graph (DAG) with vertices corresponding to computation and edges to intermediate results
  11. What operations need to be defined for each node in backpropagation? - Answers ✅✅ The forward computation y1,...,yk = f(x1,...,xk) and the backward computation yi/ xj ∀i,j
  1. What are some benefits of decision trees? - Answers ✅✅ Their resulting models are highly human-interprable, they can fit arbitratily complex functions, and they can use a mix of discrete and continuous inputs
  2. How many features can DTs use? - Answers ✅✅ Up to all of the input features, although not all features must be included in the DT
  3. May a single feature appear in multiple DT branches? - Answers ✅✅ Yes, this is possible if a feature is split incompletely or on continous values. Repeatedly splitting on discrete features does nothing, though
  4. Are DTs unique? - Answers ✅✅ No, many trees represent the same logical equation, but they may not have the same size. Thus, it's best to select the smallest possible DT with proper logical separation, although this is NP-hard
  5. Why do we want a small decision tree? - Answers ✅✅ Small trees embody the prior of simplicity, and thus they're less likely to overfit
  6. Apart from NP-hard implementations, what should we do? - Answers ✅✅ Build the tree greedily!
  7. What heuristic should we use when greedily constructing a decision tree? - Answers ✅✅ Information gain on conditional splits on labels
  8. How do we deal with splits on continuous features? - Answers ✅✅ There are INFINITELY- many such splits. IG only changes when a threshold crosses at a datapoint. Thus, we sort the dataset values and consider thresholds between consecutive datapoints. we compute IG for each threshold and select the threshold with maximum IG
  9. How do we avoid overfitting on DTs? - Answers ✅✅ 1) Add hyperparameters to control tree size (depth, #nodes, min_split_number), 2) Stop early based on validation performance (when validation saturates), 3) Post pruning
  10. How does post pruning work for DTs? - Answers ✅✅ Grow the full DT on the training set, then consider the impact of removing each node on validation performance. Greedily prune the node that improves validation performance the most
  1. Why are decision boundaries composed of axis-aligned segments? - Answers ✅✅ Decision trees separate based on single variables, not functions of multiple variables.
  2. What is the maximum depth of a decision tree? - Answers ✅✅ The number of unique splits
  3. When would a leaf of a fully-expanded DT have more than one datapoint in the leaf? - Answers ✅✅ If Bayes error is NOT zero; need to add another variable to split on!
  4. How are decision trees trained? - Answers ✅✅ Taking a dataset and splitting on certain variables. Variables to split on are selected with information gain (highest IG of all the given splits, selected greedily)
  5. Does the DT training algorithm find the optimal tree every time? - Answers ✅✅ NO, it's greedy though and should get a decently good one.
  6. What is entropy? - Answers ✅✅ A measure of inhomogeneity of class labels in a dataset
  7. What is conditional entropy? - Answers ✅✅ A measure of inhomogeneity of class labels in a dataset when splitting the dataset based on a variable, with respect to the variable that is being split on
  8. What is information gain? - Answers ✅✅ Entropy minus conditional entropy
  9. Why do we rank splits by information gain? - Answers ✅✅ It maximizes the likelihood of a split separating fully based on class labels
  10. How can decision trees handle continuous variables? - Answers ✅✅ Split between each individual continuous value, and determine which split is highest
  11. What is bias? - Answers ✅✅ Error due to assumptions in the model not matching the problem (modelling error). Really, average error between function and the model over all possible datasets.
  1. When do ensemble methods work well? - Answers ✅✅ When each individual member of the ensemble is (i) accurate (better than chance) and (ii) diverse (uncorrelated errors on new examples)
  2. How does bagging work? - Answers ✅✅ Tries to reduce variance in strong learners. For uncorrelated errors, expected error goes down, and on average they do better than a single classifier
  3. How does boosting work? - Answers ✅✅ Tries to reduce bias in weak learners. Each model is good at different parts of input space, and on average they do better than a single classifier
  4. How do we do bagging? - Answers ✅✅ Given a dataset of N points, sample N training points with replacement and train a model. Repeat this M times
  5. How does bagging work with regression? - Answers ✅✅ At test time, run each model and average their output
  6. How does bagging work with classification? - Answers ✅✅ Take the majority vote
  7. When should we use bagging? - Answers ✅✅ When your strong learners overfit the dataset, and when you have a reasonably-sized dataset
  8. What's a convenient implementation idea unique to bagging? - Answers ✅✅ Each ensemble member can be trained in parallel, rather than in sequence. Boosting requires sequential training
  9. How do random forests work? - Answers ✅✅ 1) Train an ensemble of decision trees with bagging. 2) During learning, at each split only consider a random subset of attributes/thresholds when selecting what to split on ( d features used often). 3.) Select majority vote from forest
  10. How do we re-weight training data when boosting? - Answers ✅✅ Assign importance values, inversely proportional to the weighted sum of distances to points which are not fit well as you add new models.
  1. How do we weight models in the "committee" when boosting? - Answers ✅✅ Weight each model based on the error rate (really, classifier quality)
  2. Does bagging help reduce bias or variance? - Answers ✅✅ Bagging reduces variance, especially in strong learners
  3. How does the correlation between model outputs effect the performance of a bagged ensemble? - Answers ✅✅ As correlation goes down, ensemble performance increase goes up
  4. Why would we want to introduce additional randomness in Random Forests? - Answers ✅✅ The greedy decision tree learning algorithm will likely use the same attributes early on, despite resampling.
  5. Does boosting help reduce bias or variance? - Answers ✅✅ Boosting reduces bias, especially in weak learners
  6. How does L2 boosting work? - Answers ✅✅ We search for the models and weights that minimize error over the final model
  7. How does Adaboost work? - Answers ✅✅ Initialize importance weights to 1. For each model,{ train the classifier to minimize weighted exponential loss. Then compute the weighted error of the classifier, and the classifier quality. Update weights }
  8. What two characteristics describe clustering techniques? - Answers ✅✅ Grouping all given examples (exhaustive) into disjoint clusters (partitional) such that within-cluster examples are similar and without-cluster examples are different
  9. What are partition algorithms? - Answers ✅✅ "Flat" clusterings, whereby each point belongs to exactly one cluster. E.g.: k-means, k-medioids, Gaussian mixture models (GMMs)
  10. What are hierarchical algorithms for clustering? - Answers ✅✅ "Hilly" clusterings, where by each point often belongs to more than one cluster, and local internal clustering is common. Bottom-up is agglomerative, and top-down is divisive
  1. What are downsides of k-means? - Answers ✅✅ Not outlier-resistant, may need to run several times to get a low SSE, and only works well on spherical clusters of similar size
  2. What is the expectation maximization algorithm? - Answers ✅✅ First, initialize probabilities of each point being in each gaussian, means of gaussians to random points, and covariances to identity matrices. In the E-step, compute fractional assignment of each point being generated from class c, and normalize to 1. In the M-step, update the parameters based on fractional assignment
  3. What is a problem with using GMMs? - Answers ✅✅ Because log-likelihood can go to infinity if one gaussian has only one point assigned. To solve this, monitor each component, reset to a random value if it starts to collapse (with large covariance), OR add a prior and do MAP in the M-step of EM
  4. What is the identifiability problem for GMMs? - Answers ✅✅ Log-likelihood in GMMs has multiple identical maxima. No easy fix, but of limited harm, although it may slow convergence.
  5. How does EM behave when constructing GMMs? - Answers ✅✅ Guaranteed to converge to a local maxima in finitely many steps, although may be slow. Not guaranteed to find global optima.
  6. How do we deal with nonoptimality in GMMs? - Answers ✅✅ Restart multiple times, and choose the one with the highest log-likelihood
  7. How do we interpret GMMs? - Answers ✅✅ They produce a full density model of the data, so you can sample new synthetic data or evaluate the probability at an untested point
  8. How do we get from GMMs to k-means? - Answers ✅✅ 1) Assume a hard-assignment frather than fractional/probabilistic, and 2) assume all gaussians have the same isotropic covariance matrix
  9. What is clustering? - Answers ✅✅ An unsupervised method for roughly separating probable classes by groupings in input space
  10. What is k-means? - Answers ✅✅ An algorithm for performing unsupervised clustering
  1. How is k-means implemented? - Answers ✅✅ Initialize k centroids. Iterate through centroids, taking the mean of the data assigned to each as the new, updated position for each centroid ci. Continue until no updates to labels.
  2. What does k-means optimize? - Answers ✅✅ Sum of squared error for a given number of clusters, k
  3. How is coordinate descent related to the k-means algorithm? - Answers ✅✅ Coordinate descent optimizes two variables by switching between them and performing gradient ascent for log likelihood.
  4. Is the k-means algorithm guaranteed to converge? If so, to local or global optima? - Answers ✅✅ Yes, in finitely-many steps, but to a local minimum.
  5. How do we pick hyperparameters for k-means? - Answers ✅✅ By observing the "elbow" on the SSE-vs-k graph
  6. Is k-means sensitive to outliers? - Answers ✅✅ Yes, very much so!
  7. Is k-means sensitive to initialization? - Answers ✅✅ Yes, to find an optimal clustering, run multiple times and find the one with lowest SSE
  8. What types of clusters does k-means work best for? - Answers ✅✅ Isotropic, nonoverlapping spherical clusters
  9. What sort of model does GMM assume generated the data? - Answers ✅✅ Gaussian model! Not necessarily isotropic, and not necessarily the same size
  10. What can GMM do that k-means can't? - Answers ✅✅ Model wide/long distributions, model a collection of distributions of different sizes/radii
  11. Why is maximum marginal likelihood difficult to optimize? - Answers ✅✅ Exact solutions are only known for a small class of distributions, so numerical integration techniques are often needed

between-cluster similarities. These metrics depend on the dataset and the distance measure used

  1. How do we evaluate clustering quality if we were spontaneously given labels after clustering? - Answers ✅✅ Rand index or purity.
  2. How do you compute Rand Index? - Answers ✅✅ RI = (a+d)/(a+b+c+d) (a: in the same group in both P and G (same cluster, same labels); b: in the same group in P but different in G (same cluster, different labels); c: in different groups in P but same in G (different cluster, same labels); d: in different groups in both P and G (different clusters, different labels))
  3. How is purity computed in HAC? - Answers ✅✅ Purity = fraction of points that would be labelled correcly by a majority vote per cluster where all points get the cluster label
  4. What is PCA? What sort of subspaces can it find? - Answers ✅✅ A dimensionality- reduction technique that is capable of finding linear projections
  5. How does PCA choose directions for the lower dimension representation? - Answers ✅✅ PCA keeps the dimension with the greater variance, i.e. the dimension that varies most, it selects this dimension to retain
  6. How can PCA be solved? - Answers ✅✅ Solve the eigenvector problem using the covariance matrix: Σ _x@w = λ @w. Sort both eigenvalues and eigenvectors in descending order by eigenvalues, return eigenvectors corresponding to the top k.
  7. How can we see what fraction of variance a dimension of PCA's output captures? - Answers ✅✅ Plotting a graph of fraction of variance explained by a certain principal component to judge how represenative the visualization is
  8. What sort of relationships can PCA not identify? - Answers ✅✅ Parallel relationships where data points have labels
  9. Why might applying PCA as a preprocessing step lead to issue? - Answers ✅✅ It may reduce variance of interest in the dataset, because it was not examined by a person beforehand
  1. What is a sequential decision process and how does it differ from standard machine learning problems? - Answers ✅✅ Sequential decision making tasks' outputs changes the next input that is received.
  2. What is a Markov Decision Process? - Answers ✅✅ A MDS is a decision process defined by a set of possible states S, a set of possible actions A, a reward function R:S→ ℝ, and a transition function T:SxA →Ω _S. It is defined by the Markov assumption
  3. What is a state in a MDS? - Answers ✅✅ A place where the learning agent can exist or occupy in its environment
  4. What is an action in a MDS? - Answers ✅✅ A possible decision the agent can take, not necessarily but frequently impacting the environment and the success or failure of the agent
  5. What is a transition function in a MDS? - Answers ✅✅ A mapping between state-action pairs and a distribution over next states
  6. What is a reward function in a MDS? - Answers ✅✅ A mapping between states and rewards
  7. What is a behavior policy in a MDS? - Answers ✅✅ A mapping from states to actions (or distributions over actions)
  8. What is model-free vs model-based reinforcement learning? - Answers ✅✅ Model-free directly learns a policy, while model-based learns the transition function and reward function from observation and then use these to find best policy
  9. What is the principal behind policy-gradient methods like REINFORCE? - Answers ✅✅ If trajectory reward is positive, push up the probabilities of all the actions leading to that trajectory. If negative, push down those action probabilities
  10. What is imitation learning and how does it differ from reinforcement learning? - Answers ✅✅ Rather than using feedback from the environment, IL attempts to mimic an expert demonstration of a trajectory set