Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Machine Learning - Final Collaborative Study Guide-comprehensive-2024-2025.docx

Typology: Exams

2023/2024

1 / 41

Download Machine Learning - Final Collaborative Study Guide-comprehensive-2024-2025.docx and more Exams Advanced Education in PDF only on Docsity! Machine Learning - Final Collaborative Study Guide- comprehensive-2024-202 5 Note!!! • I (who will remain anonymous) really hope everyone can benefit from this study guide! • Please comment if you feel like some information may be wrong but you’re hesitant to change it. (If you’re confident it’s wrong, then change it.) • PLEASE CONTRIBUTE A LITTLE if you’re gonna use it`` • https://www.cc.gatech.edu/~bboots3/CS4641-Fall2018/Lecture23/Final_Review.pdf • Final Midterm study guide: • Feel free to make new outline topics according to the one prof said in class • Useful Comparison chart for different algorithms: https://www.dataschool.io/comparing- supervised-learning-algorithms/ • excellent youtube videos: search "victor lavrenko" and then a topic ("pca", "svm", etc) • Happy studying =) <3 <3 <3 Table Of Contents Similarities / Differences Between Algorithms 1 Supervised Learning 11 Vocab 20 Learning Theory 21 Randomized Optimization 23 Unsupervised learning 25 C l us t e ri n g 2 6 Similarities / Differences Between Algorithms • Decision tree o Bias: prefer small decision trees ▪ High bias with small decision trees (underfit possible) ▪ low bias with large decision trees but higher variance (because overfit possible) o Search algorithm: greedy (always makes the choice that seems to be the best at that moment) o Heuristic function: information gain (choose split with maximum change in entropy) ▪ Entropy is the disorderliness, best features are chosen based off how much it can decrease randomness and get closer to 50/50. o Overfitting solution: pruning ▪ Pre-pruning: use validation set to determine when to stop growing the tree or just stop growing “when data split is not statistically significant” (Lect 3 Slide 8) ▪ Post-pruning - After building decision tree, remove nodes that decrease performance on validation set Shallow vs Deep Pruning - o On datasets: o Constructing a decision tree is to find all the attributes with highest information gain. o Strengths include: ▪ Can model Any Boolean function ▪ Fast and simple to implement ▪ Testing time is O(log n) ▪ Can convert to rules – Handles noisy data ▪ can learn non-linear relationships ▪ fairly robust to outliers ▪ Because outliers don’t impact the Information Gain enough to create an erroneous split that might overfit the data o weaknesses include: ▪ Univariate splits/partitioning using only one attribute at a time --- limits types of possible trees ▪ Restriction Bias ▪ Large decision trees may be hard to understand ▪ More complex decision trees may be overfit ▪ Requires fixed-length feature vectors (since adding a new feature would result in re-building the tree completely) ▪ This is not true for Naive Bayes because it’s based on probabilities rather than looking at the dataset itself ▪ Non-incremental (i.e., batch method) (since adding a new instance would result in re-building the tree completely) • Decision stump o Def: one-level decision tree with one internal node (the root), which is immediately connected to the terminal nodes (its leaves) o Weak learner ▪ So you can use them in ensemble learning o High Bias (doesn’t fit training data very well) o Low variance (doesn’t overfit to training data, model is fairly replicable) • Linear regression (gradient descent) o Fights overfitting by using regularization o Heuristic function: Mean squared error o Algorithm: Gradient descent/Calculus/Anything that helps optimize the heuristic function. ▪ Gradient descent requires a convex function to guarantee that a random point converges to the absolute minimum; this can be guaranteed by careful selection of the cost function o Linear regression must have a global minimum. o Overfitting/variance: combated with regularization o Should be the first algorithm you try when using ML. ▪ Train the model and come up with a distribution (a line or decision stump, typically) that divides the plane into positive and negative examples) ▪ Find a weak classifier for the hypothesis of that timestep ▪ A weak classifier is a classifier than can always predict something better than just guessing. Ideally, though, we want this weak classifier to produce small error (noted by ) ▪ d instances. ▪ Understanding the pseudocode: ▪ t will always be positive because error is always between 0 and 1 ▪ `Taking the natural log of 1-will always produce something positive if is between 0 and 0.5 ▪ Consider the terms in the exponent of the weight update step ▪ ▪ Both yi and ht are values either -1 or 1. ▪ So, if the example is correctly classified, this portion will be -1*-1 or 1*1, so always = 1. If it is misclassified, it will be -1. ▪ This just determines the sign of this exponent based on whether the example was classified correctly ▪ So, since the t is given a negative sign, we can see that this exponent will be negative if the example is correctly classified and it will be positive if the example is misclassified. ▪ marginalization ▪ We then return our final hypothesis at the end ▪ STRENGTHS OF BOOSTING: ▪ Fast and simple to program ▪ No hyperparameters to tune (besides T) ▪ No assumptions on the weak learner ▪ Always drives down the test error even after the training error reaches zero ▪ WHEN BOOSTING CAN FAIL: ▪ Given insufficient data ▪ Overly complex weak hypotheses ▪ If you use complex models (Neural nets, etc.) as the learner ▪ Can be susceptible to noise ▪ When there are a large number of outliers • SVM o Use SVMs when there are multiple data points that are close together. The SVM will help to separate the close together points. o If using a linear kernel, your data must be linearly separable o Does not overfit o Find the maximum margin hyperplane o Training time is high, not suited for large datasets o Dual Representation makes optimizing easier ▪ Can’t handle noise very well for a linear svm o STRENGTHS - 1. Find globally best model 2. Good generalization 3. Works well with few training instances 4. Efficient Algorithm 5. Amenable to kernel trick.. • Naive Bayes • Finds the probability of predicting particular hypothesis given the data using Bayes rule • The calculation procedure • Advantage: o Fast to train o Fast to classify o Not sensitive to irrelevant features o Handles real and discrete data o Handles streaming data Disadvantage: -Assume independence of feature How will these algorithms perform on various datasets? Few samples, Few attributes - SVM 1. Many samples, Few attributes - DT 2. Few samples, Many attributes - SVM 3. Many samples, Many attributes - Boosting/NN Global vs local optima during learning? Gradient descent will lead to a local optima if the loss function is not convex Parametric methods (summarizes data with a set of parameters of fixed size): • Logistic Regression • Perceptron • Naive Bayes • Simple Neural Networks • Linear SVM Nonparametric method: • KNN • ID3 • Kernel SVM Decision boundaries • Decision tree o o Overfit: o • logistic regression • neural nets - It is hard to determine why neural nets make certain decision boundaries • 1NN - Arbitrary (Voronoi tessellation) • KNN - arbitrary • ensemble, boosting - axis aligned because underlying learners are decision stumps o Kind of carves out a pixelated shape, as Professor showed us in class o • SVM - linear / kernel dependent ? o Linear case: the line should look as though it’s equidistant between the two “clusters” (see below) o • Naive Bayes - MLE ( no decision boundary) not true it has a boundary - see the link below. • Good document: http://michael.hahsler.net/SMU/EMIS7332/R/viz_classifier.html (this is brilliant) Theory • Training vs. Testing (see Vocab) • Generalization o Goal of Machine learning • Overfitting, underfitting (see Vocab) • bias vs variance o https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance- trade-off-in-machine-learning/ • cross validation o Using several sets of data in order to measure error • shattering, VC dimension (see Vocab) Supervised Learning • Difference between UL and RL SL - works with labeled datasets and provides immediate feedback on if a sample was “good” or “bad” - “optimizing model“ means that it labels data wel l UL - does not give any feedback on how right some classification is - “optimizing model“ means that it clusters such that it scores wel l RL - does not give immediate feedback - “optimizing model“ means that there’s a behavior that scores well • Machine learning algorithms <P,T,E> Each machine learning problem can be precisely defined as the problem of improving some measure of performance P when executing some task T , through some type of training experience E . • Classification VS Regression (See Vocab) • Error, accuracy o FP = Type I Error o FN = Type II Error o Accuracy = (TP + TN) / (TP + TN + FP + FN) (Over all classes) o Error = 1 - accuracy • Confusion matrix: divided up into TP (true positive), TN (true negative), FP (false positive), FN (false negative) o Confusio n matrix Pred Fals e Pre d Tru e Actual false TN FP Actual True FN TP o • Precision = TP / (TP + FP) o Memory tip: Precision starts with p, prediction also starts with p, you’re looking at TP vs. all predicted positives. • Recall = TP / (TP + FN) o Memory tip: Recall starts with R, Real starts with R, so you’re looking at TP vs. the “real” positives • F1 = (2PR) / (P+ R) o Use this score because accuracy fails on unbalanced datasets o Is this (2*precision*recall)/(precision+recall)? Perceptron, Linear regression, Logistic regression • Regularization (relation to bias and variance) i o Regularization: same as Linear Regression regularization Instance Based Learning- KNN: • The Curse of Dimensionality: as the number of features/dimensions increases, the amount of data needed to generalize accurately grows exponentially Neural Net • Structure (layers) o Input layer → hidden layer(s) → output layer (contains the prediction h(x)) o Each layer has a number of nodes ▪ Each node has an activation j, unit i in layer j ▪ Also has a weight matrix controlling function (j), mapping from layer j to layer j + 1 • Activation functions (linear, logistic) o Usually a nonlinear function o Logistic activation function is the sigmoid function, which activates after the threshold similar to logistic regression o • Backpropagation o If the output of the network is correct, no changes are made o If there is an error, weights are adjusted to reduce the error o The trick is to assess the blame for the error and divide it among the contributing weights o Each hidden node jis “responsible” for some fraction of the error j (l) in each of the output nodes to which it connects o j (l) is divided according to the strength of the connection between hidden node and the output node o Then, the “blame” is propagated back to provide the error values for the hidden layer • Restriction bias o Restriction bias is the representational power of an algorithm, or, the set of hypotheses our algorithm will consider. So, in other words, restriction bias tells us what our model is able to represent. o NNs can model continuous functions with one big enough hidden layer o NNs can model arbitrary functions with an extra hidden layer = at least 2 hidden layers o HOWEVER, these concepts, while feasible, can cause overfitting ▪ Cross-validation can help decide network structure and when to stop training to combat overfitting o NN training has the same familiar concave plot for CV error that helps us decide when to stop training (a.k.a. how many epochs) • Preference Bias o Preference bias is simply what representation(s) a supervised learning algorithm prefers. For example, a decision tree algorithm might prefer shorter, less complex trees. In other words, it is our algorithm’s belief about what makes a good hypothesis. o NN weights are initialized with small, random values b/c: ▪ Helps avoid local minima ▪ Variability in training to prevent repetition of errors (such as not halting) ▪ Large weights can lead to overfitting because you can represent arbitrarily complex functions o Thus, because of these properties, NNs prefer ▪ Low complexity, simpler explanation = simpler NN structures ▪ Think Occam’s Razor ▪ Why use complex when simple do trick Decision Tree (working knowledge): • Splitting o Pick best attribute (splits data most evenly) ▪ ID3 uses max information gain ▪ Gain(S,A) = Entropy(S) - v|Sv||S|Entropy(Sv) ▪ We want entropy to go down as much as possible when choosing an attribute, meaning max information gain • • Entropy / Conditional Entropy o Measures the impurity of S ▪ S = subset of training examples o Remember p log p o o How many bits are needed to tell if item X is positive or negative (true/false) o Entropy for a 50/50 split = 1 ▪ Ex: 3 yes / 3 no => H(S) = -36log236 - 36logg236= 1 bits o Entropy for a 100/0 (completely certain) split = 0 ▪ Ex: 4 yes / 0 no => H(S) = -44log244 - 44log244= 0 bits o A good training set for learning has close to 1 bits for entropy • Information Gain o The expected reduction in entropy of target variable Y for data sample S o Which attribute is most useful for discriminating between classes to be learned o Used to decide ordering of attributes in nodes of a decision tree • • Pruning (deep vs shallow / stumps) o Done by replacing a whole subtree by a leaf node o Replacement is done if the expected error rate in the subtree is greater than in the single leaf • • Overfitting o Consider possible noise ▪ Two examples may have same attributes but different classifications ▪ Instances may be labelled incorrectly o Some attributes may be irrelevant, leading to overfitting ▪ Meaningless regularity in the data o How to avoid overfitting? ▪ Stop growing when data split is not statistically significant ▪ Get more training data ▪ Remove irrelevant attributes ▪ Post-prune the tree Ensembles • Bagging • Training Set - set of labels from input to output used to train the ML Algo • Testing Set - set of labels from input to output used to test how well your ML Algo generalizes • Validation Set - set of labels from input to output used to gain an understanding of where the model starts to overfit • Supervised Learning - Take examples of inputs and outputs, now given a new input, predict an output • Bias - How close your predictions are to your input data (Difference in expected value of the estimator and the true value) o High bias - Underfitting, model makes multiple errors when trying to predict o Low bias - Model makes correct predictions, more closely fits the data • Variance - How much your prediction models differ from changes in the input data (how sensitive it is to changes in the input data) o High variance - Overfitting, model is sensitive to the data, fits to noise, and unable to generalize o Low variance - Model is able to generalize all inputs and not fit to noise. • Variance vs. Bias Tradeoff - We want our models to have low bias and low variance, however these normally have a tradeoff. • Regularization - Penalty given to complex models, increases bias and decreases variance • Shatter - a model class can shatter a set of points if for every possible labeling over those points, there exists a model in that class that obtains zero training error • VC Dimension - the maximum number of points that can be arranged so that the class of models can “shatter” those points; used as a measure of the power of a particular class of models. o For example, the VC dimension of a hyperplane in 2D is 3 (in d dimensions it is d+1). A sine wave has infinite VC dimension. o A problem is PAC Learnable iff its VC dimension is finite • Weak Learner - a classifier that is only slightly better than random guessing. Has high bias • Linearly separable - The data can be separated by a line, plane, or hyperplane. • Lazy Learner - Does no “training” but rather has all the computation when trying to predict • Precision - Of the samples you predicted true, how many were actually true o True positives / (True positives + False positives) • Recall - Of the samples that are actually true, how many did you predict to be true o True positive / (True positives + False negatives) • No Free Lunch - If you average the performance of a model across all possible learning problems, it will do no better than guessing • Occam’s Razor - Why use many words when few words do trick o Why use complex when simple do trick o Minimal complexity produces best result (prevents overfitting) Experimental Protocol • Never mix your training and testing set - splitting the data into a training and testing set should be the FIRST step • Need a healthy ratio of labels o It is problematic if you have a lot of one label, and not a lot of another. Learning Theory Often, there is a tradeoff between bias and variance. If our model is too “simple” and has very few parameters, then it may have large bias (but small variance); if it is too “complex” and has very many parameters, then it may suffer from large variance (but have smaller bias). Randomized Optimization • Local Search / Gradient Descent o Local Search: Use single current state and move to neighboring states ▪ Find or approximate the best state according to some objective function ▪ Only optimal if the space to be searched is convex o Idea: start with an initial guess at a solution, and incrementally improve it until it is a solution. o Hill-Climbing Search: “Like climbing Everest in thick fog with amnesia” ▪ Move in the direction of increasing evaluation function until you can’t (reached the peak (local optimum)) ▪ Iteratively trying to maximize the fitness function ▪ snext=argsmax f(s) ▪ Greedy local search o o • Clustering vs. Feature Selection vs Dimensionality Reduction o Clustering o Feature Selection o Dimensionality Reduction Clustering • K-means Clustering o Iteratively re-assign points to the nearest cluster center o K-Means(k, X) algorithm ▪ Randomly choose k cluster center locations (centroids) ▪ Loop until convergence: ▪ Assign each point to the cluster of the closest centroid ▪ Re-estimate the cluster centroids based on the data assigned to each cluster o Finds the local optimum of argSmini=1kxSix-i22 ▪ S=S1,, ..., Skis a partitioning over X=x1,, ... , xn s.t. X=i=1kSiand i = mean(Si) o Pros ▪ Finds cluster centers that minimize conditional variance (good representation of data) ▪ Easy to implement o Cons ▪ Need to choose k (amount of clusters) ▪ Sensitive to outliers ▪ ▪ ▪ Prone to local minima ▪ All clusters have the same parameters (e.g. distance measure is non- adaptive) ▪ Hard clustering (each instance is only assigned to exactly one cluster) o Very sensitive to the initial points chosen ▪ run K-Means many times each with different initial centroids ▪ Can also seed the centroids using a better method than randomly choosing the centroids (e.g. farthest-first sampling) o K-Medoids ▪ Represent the cluster with one of its members, rather than choosing the mean of its members ▪ Choose the member (data point) that minimizes cluster dissimilarity o Importance of similarity measure • Agglomerative / Hierarchical Clustering o agglomerative: Start with each point as its own cluster and iteratively merge the closest clusters o Cluster similarity could be based on average distance between points, maximum distance, minimum distance, distance between means, distance between medoids o Can dynamically find the number of clusters (k) without prior domain knowledge ▪ Number of clusters is found via a threshold based on max number of clusters or based on distance between merges o Dendrograms ▪ Dendrograms are tree diagrams. Applied here, they can be used to show the hierarchy of clusters through different iterations. ▪ With agglomerative clustering, the “bottom-up” approach, the dendrogram will have many clusters at the bottom. As you go up through progressive iterations, the distance between clusters increases (as each clusters’ ‘definition’ starts including more points and actual clusters become defined. ▪ With divisive clustering, the dendrogram would look the opposite. As a “top-down” approach, it will have 1 cluster at the bottom and then split on upwards, decreasing the distance between clusters as the number of clusters grows. • Hierarchical vs. Agglomerative Clustering o It looks like slides didn't define hierarchical clustering, so here's a quick note about it. o Hierarchical clustering tries to build a hierarchy of clusters, meaning that we want a spectrum of clusterings ranging from many small clusters to a small amount of large clusters. o Agglomerative clustering is one form of hierarchical clustering where we iteratively merge clusters into larger ones, starting with each data point being its own cluster. o The other approach, divisive clustering, starts with one large cluster and iteratively splits the clusters until each data point is its own cluster. o • Gaussian Mixture Models and EM o Assumptions ▪ K components. The i’th component is i ▪ Component ihas an associated mean vector i ▪ GMM Assumption - Each component generates data from a Gaussian with mean iand covariance matrix 2I = i ▪ General GMM Assumption - Each component generates data from a Gaussian with mean i and covariance matrix i o Relationship to K-means (soft) ▪ Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster - doesn’t allow for uncertainty in class membership or for an instance to belong to more than one cluster ▪ K-means is hard clustering ▪ GMM is soft clustering - give probabilities that an instance belongs to each of a set of clusters ▪ Each instance is assigned a probability distribution across a set of discovered categories ▪ When you have to choose - take the cluster with the highest probability o Should have working knowledge of fitting a GMM 1. Pick a component at random. Choose component i with probability P(i) 2. Datapoint ~ N(i, 2I) for GMM -- Datapoint ~ N(i, i)for General GMM ▪ Mixture models ▪ Weighted sum of a number of pdf’s (probability distribution functions) where the weights are determined by a distribution ▪ ▪ GMM ▪ The weighted sum of a number of Gaussians where the weights are determined by a distribution , i=0ki=1 ▪ 3) each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset of n-1 features ▪ 4) the procedure continues until a predefined number of features are left o Bidirectional Search (BDS) ▪ Applies SFS and SBS simultaneously ▪ SFS is performed from the empty set ▪ SBS is performed from the full set ▪ To guarantee that SFS and SBS converge to the same solution ▪ features already selected by SFS are not removed by SBS ▪ Features already removed by SBS are not added by SFS Dimensionality Reduction PCA • Orthogonal projection of data onto lower-dimension linear space that maximizes variance of projected data and minimizes mean squared distance between data point and projections • Algorithm: o Given data in {x1, ... , xn}, compute the covariance matrix ▪ X is the n x d data matrix ▪ Compute data mean (average over all rows of X) ▪ Subtract mean from each row of X (centering the data) ▪ Compute covariance matrix dxd =XTX o PCA basis vectors are given by the eigenvectors of • ▪ {qi, i}i=1...n eigenvectors, eigenvalues of ▪ 12... n ▪ Each column of Q gives weights for a linear combination of the original features ▪ E.g. 0.34 feature1 = 0.04 feature2 - 0.64 feature3 + ... • Can ignore the components of lesser significance - choose the first k eigenvectors based on their eigenvalues o Final data set has only k dimensions o Lose some information but if eigenvalues are small enough, you don’t lose much • Re-projected data matrix given by X=XQ • This provides best reconstruction through the orthogonal line. It provides the smallest L2 error. o Reconstruction: moves from N to M Dimensions • The disadvantage of using PCA is that the discriminative information that distinguishes one class from another might be in the low variance components, so using PCA can make performance worse. • assumes that features that present high variance are more likely to have a good split between classes • Doesn’t discard less relevant features, but linearly transforms them into new attribute • A principal axis with eigenvalue 0 = irrelevant feature but might still be useful(relevance vs. usefulness) o Relevance refers to information gain in an ideal learner (the Bayes optimal classifier) o Usefulness refers to decrease in error in a particular learner • PCA looks for properties that show as much variation across classes as possible to build the principal component space. The algorithm use the concepts of variance matrix, covariance matrix, eigenvector and eigenvalues pairs to perform PCA, providing a set of eigenvectors and its respectively eigenvalues as a result. Understand meaning of principle components • Principle component #1 points in the direction of largest variance o Also finds direction mutually orthogonal to first component found o Each subsequent principle component is orthogonal to the previous ones and points in the directions of the largest variance of the residual subspace Orthogonal projection onto lower dimensional space Maximize variance / minimize squared error Use for preprocessing Applications • Visualization o Visualizing the clusters of the handwritten digit data set ▪ From high dimension to 2 dimensions (though only 16% of the variance is explained in the 2 dimensions) • facial recognition o Eigenfaces ▪ The eigenvectors of the covariance matrix of the probability distribution of the vector space of human faces ▪ The standardized face ingredients derived from statistical analysis of many pictures of human faces ▪ A human face may be considered to be a combination of these standard face ingredients o To generate a set of eigenfaces ▪ Large set of digitized images on human faces is taken under the same lighting condition ▪ Images normalizes to line up the eyes and mouths ▪ The eigenvectors of the covariance matrix of the statistical distribution of face image vectors are then extracted ▪ Eigenvectors - eigenfaces Markov Decision Processes • Definition: A Markov decision process (MDP) is a discrete time stochastic (random) control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. • Defined by o Set of states o Set of actions o Transition function T(s, a, s') ▪ probability that a from s leads to s’ - P(s’|s,a) ▪ Aka transition model, dynamics o Reward function R(s, a, s') o Start state o Maybe a terminal state • What does “Markov Assumptions” Mean? • Markov First Order Assumption means that the next state actions only depend on the previous state and the evidence being observed at the current state. o Given the present state, the future and past are independent • Policies: o o When you’re in a state, can reason which action to take next without having to consider your previous states • A mapping from states to actions (S -> A): given the agent is at some state, what action should it take? • : SA • * = optimal action from state s o Optimal policy - maximizes the expected utility if followed (we want to find this) • Explicit policy - defines a reactive (reflex) agent • MDP search trees: • Can take you a discrete set of actions at any state • Node = state, edge = action, child = next state • When take action a (go down one edge), can end up at any of the other children because stochastic transition function • (s, a, s') is a transition • Reward vs Utility vs Value vs Q function: • Value = utility • V*(s) = expected utility of starting in s and acting optimally (average sum of (discounted) rewards) o If we know the value at every state, know the optimal policy o With value functions, a greedy policy will be optimal no matter the state • Q*(s, a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally • • ← Bellman Equations o Definition of “optimal” utility via expectimax recurrence Passive Reinforcement Learning • Policy evaluation o Fixed policy o Don’t know transition model or reward function o Goal: learn the state values o Learner is along for the ride” ▪ No choice in actions to take ▪ Just execute the policy and learn from experience ▪ NOT offline planning - you actually take actions o Done for every state - takes a lot of data and time • Direct Evaluation o Goal - compute values for each state under the policy o Average together observed sample values ▪ Act according to the policy ▪ Every time you visit a state, write down what the sum of discounted rewards turned out to be ▪ Average those samples o Pros ▪ Easy to understand ▪ Doesn’t require knowledge of T, R ▪ Eventually computes the correct average values, using just sample transitions o Cons ▪ Each state must be learned separately ▪ o Can’t use policy iteration because we need T and R Temporal Difference Learning • MODEL-FREE way to do policy evaluation • Update V(s) each time we experience a transition (s, a, s’, r) • Likely outcomes s’ will contribute updates more often • Policy is fixed - doing evaluation • Sample of V(s): sample = R(s, (s), s') + V(s') • Update to V(s): V(s) (1-)V(s) + *sample • . V(s) V+ (sample - V(s)) • • Problems o If we want to turn values into a new policy, we’re sunk • On-policy vs. Off-policy • Off-policy o Converge to optimal policy, even when acting suboptimally • Exploration vs. Exploitation Epsilon greedy search - consider the probability of choosing a random action with a low probability epsilon to help explore. • How is Q value calculated? • • Alpha closer to 1 - more weight on new sample • Alpha closer to 0 - more weight on past • Approximate Q-learning With linear Q-functions: relation to least squares regression Information Theory （don’t think we need this?) Midterm Solutions/Notes How this works: • I take notes by writing down the question and the answer to the problem in 1 sentence. • If it’s a true or false or some equivalent, bolded words like can indicate what the question was • I need help explaining some solutions. I use the words also and explain to indicate information in addition to the question. If you see something wrong, just edit it! 4. Linear decision boundaries: Perceptron Non-linear decision boundaries: polynomial regression, decision tree depth 2, knn Also I think logistic regression, SVMs with linear kernels and naive bayes produce linear decision boundaries. 5. ID3 can produce suboptimal decision trees. 6. Max depth of decision tree can be > # of attributes Explain: You can make more complex boundaries for the same set of attributes This is wrong - the max depth will be n-1 (n = # of attributes) if you have a fully complex DT that classifies only 1 sample at a time. 7. Max depth of decision tree must be less than number of training instances. Explain: I know ID3 requires instances on either side of a split, which might be what this is referencing, but I don’t know why you couldn’t just make a massive, randomly optimized decision tree. 8. Splits in lower parts of decision trees are more likely to be modelling noise 9. Gradient Descent may converge to a local, non-global optimum 10. Decision trees and logistic regression can produce the same decision boundary. Also I think any pair of classification algorithms are capable of producing the same decision boundaries given special initializations and data. Maybe perceptron won’t? I dunno. 11 - 15: Do the following algos guarantee global optima or merely local optima? 46. Training naïve bayes classifier with infinite training examples would not guarantee zero training or test error Help – is there anything magical along these lines anyone knows? https://www.cs.cmu.edu/~tom/10701_sp11/midterm_sol.p df See the solution of problem 1.1. Basic idea: it’s a probabilistic approach, so nothing is guaranteed. 47. KNN and NB don’t both assume conditional independence Help – can someone explain? lol http://www.cs.virginia.edu/~hw5x/Course/TextMining-2018Spring/_site/docs/PDFs/kNN%20& %20Naive%20Bayes.pdf On page 6, it showed that kNN only used Bayes’ rule. On page 27-29, it showed NB also assumed conditional independence. -> I mean, KNN doesn’t really use independence in its algorithm or its decisions. It’s just looking at the k Nearest Neighbors. NB uses conditional independence for sure. The conditional independence for NB is on the features. KNN kind of uses conditional independence but it’s on samples not on features. Additional material