CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide, Exams of Advanced Education

CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide

Typology: Exams

2025/2026

Available from 06/23/2026

2026ReadytoReviseLatestReleased
2026ReadytoReviseLatestReleased 🇺🇸

5

(1)

5.9K documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS-7643 Quiz 4 Exam Deep
Learning Optimization &
Regularization Study Guide
Embedding - ANSWER-A learned map from entities to vectors that encodes
similarity
Graph Embedding - ANSWER-Optimize the objective that connected nodes have
more similar embeddings than unconnected nodes.
Task: convert nodes to vectors
- effectively unsupervised learning where nearest neighbors are similar
- these learned vectors are useful for downstream tasks
Multi-layer Perceptron (MLP) pain points for NLP - ANSWER-- Cannot easily
support variable-sized sequences as inputs or outputs
- No inherent temporal structure
- No practical way of holding state
- The size of the network grows with the maximum allowed size of the input or
output sequences
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide and more Exams Advanced Education in PDF only on Docsity!

CS-7643 Quiz 4 Exam – Deep

Learning Optimization &

Regularization Study Guide

Embedding - ANSWER-A learned map from entities to vectors that encodes similarity Graph Embedding - ANSWER-Optimize the objective that connected nodes have more similar embeddings than unconnected nodes. Task: convert nodes to vectors

  • effectively unsupervised learning where nearest neighbors are similar
  • these learned vectors are useful for downstream tasks Multi-layer Perceptron (MLP) pain points for NLP - ANSWER-- Cannot easily support variable-sized sequences as inputs or outputs
  • No inherent temporal structure
  • No practical way of holding state
  • The size of the network grows with the maximum allowed size of the input or output sequences

Truncated Backpropagation through time - ANSWER-- Only backpropagate a RNN through T time steps Recurrent Neural Networks (RNN) - ANSWER-h(t) = activation(Uinput + Vh(t-1) + bias) y(t) = activation(W*h(t) + bias)

  • activation is typically the logistic function or tanh
  • outputs can also simply be h(t)
  • family of NN architectures for modeling sequences Training Vanilla RNN's difficulties - ANSWER-- Vanishing gradients
  • Since dx(t)/dx(t-1) = w^t
  • if w > 1: exploding gradients
  • if w < 1: vanishing gradients Long Short-Term Memory Network Gates and States - ANSWER-- f(t) = forget gate
  • i(t) = input gate
  • u(t) = candidate update gate
  • o(t) = output gate
  • c(t) = cell state
  • c(t) = f(t) * c(t - 1) + i(t) * u(t)

L(dist) = CE b/w student and teacher predictions L(student) = CE b/w predicted output and actual L = alpha * L(dist) + beta * L(student) Advantages:

  • may work well b/c of soft predictions of teacher model
  • if we don't have enough labeled text we can still train student model to align predictions Collobert and Weston Vector Idea - ANSWER-a word and its context is a positive training sample; a random word in that sample context gives a negative training sample Word2vec Overview - ANSWER-Word2vec - a framework for learning word vector Idea:
  • we have a large corpus of text
  • every word in. fixed vocabulary represented by a vector
  • Go through each position t in the text, which has a center word c and context words o
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
  • Keep adjusting the word vectors to maximize this probability Word2vec Variants - ANSWER-Skip-Gram: Predict context words given center word

Continuous Bag of Words: Predict center word from (bag of) context words Word2vec Objective Function - ANSWER-- product over all possible center words

  • product over all words in the context window
  • P( w(t+j) | w(t); theta )
  • J(theta) = - 1 / T * log (L) Word2vec P( w(t+j) | w(t) ) - ANSWER-- Two sets of vectors for each word in vocabulary
  1. u(w) for when w is the center word
  2. v(o) for when w is a context word P( w(t+j) | w(t) ) = softmax( u(wt) * v(wt+j) ) Word2vec Expensive to Compute Solutions - ANSWER-1. Hierarchical Softmax
  3. Negative Sampling Negative Sampling Intuition - ANSWER-- For each (w, c) pair, sample k negative pairs (w, c')
  • maximize probability real word appears and minimize the probability random word appears
  • the score of an edge f(e) is a similarity (dot product) between the source embedding and a transformed version of the destination embedding
  • f(e) = cos( theta(s) , theta(d) + theta(r) ) Graph Embedding is Slow: Reason and Solution - ANSWER-- Training time dominated by computing scores for "fake edges"
  • Corrupt a sub-batch of edges with the same set of random nodes Debiasing word2vec - ANSWER-- identify gender subspace with gendered words
  • project all words onto this subspace
  • subtract those projections from the original word Problem: Not that effective and bias pervades the word embedding space t-SNE things to remember - ANSWER-1. Run until it stabilizes
  1. Set perplexity b/w 2 and N
  • perplexity loosely measures # neighbors
  • balances b/w local and global aspects of nodes
  1. Re-run t-SNE multiple times to ensure we get the same shape

t-SNE general concept - ANSWER-- Maps inputs from high dimensional space to lower dimensions for visualization

  • recursively moves similar points closer and distance points further
  • expands dense clusters and contracts sparse cluster Teacher Forcing - ANSWER-- next input to model is not predicted value, but the actual value from the training data
  • allows model to train effectively even if a mistake was made
  • if used instead of hidden-to-hidden recurrence nodes, can allow for parallelization, but model becomes less powerful
  • emerges from MLE
  • issues may arise if network is later going to be used in "closed-loop" mode where output is fed back as input Skip-Gram Model: Loss/Objective Function - ANSWER-Loss - for each position t, we try to predict the context words within a fixed window size given some context word
  • multiple these probabilities to get a likelihood
  • L(theta) = product(product(P(w_(t+j) | w_(t) ; theta))
  • Objective function: J(theta) = - 1/T log(L(theta)) Skip-Gram Model: Calculate P(w_(t+j) | w_(t) ; theta) - ANSWER-- Two vectors for each word:
  1. u_w when w is center word
  • a negative sample edge is found by taking. areal edge and replacing either the source or destination node
  • negative sampling is a bottleneck since there are many more negative edges than real edges Pytorch Big Graph: Edge Scores - ANSWER-- f(e) = cos(theta_s, theta_d + theta_r)
  • theta_s = source vertex
  • theta_d = destination vertex
  • theta_r = relation vector Structured Representation: State, Neighborhood, Propagation of Info - ANSWER- State - compactly represents all data we've seen (nodes) Neighborhood - What other elements to incorporate/how two nodes are connected (edges) Propagation: how to combine structured data to get new state/vector representations Non-Local Neural Network - ANSWER-- Allows it to learn it's own connectivity pattern
  • Does so in Data dependent way
  • it's called non-local because you don't have a specific local receptive field
  • y = 1/(c) * sum(f(x_i, x_j)g_j))
  • f is similarity function
  • g encodes the features of similarity Skip Gram Model: What is conditioned on what? - ANSWER-- Probability of a context word given a center word RNNs and LSTMs, how their update rules differ, and what they problems each have or solve (vanishing and/or exploding gradients) - ANSWER-- Solving Vanishing Gradients: The way it does so is by creating an internal memory state which is simply added to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a forget gate, which determines which states are remembered or forgotten.
  • Input gate determines the extent to which the current timestamp input should be used
  • Forget gate determines the extent to which output of the previous timestamp state should be used
  • Output gate determines the output of the current timestamp. RNN Advantages - ANSWER-1. Allows generalization to varying size sequence lengths not seen in training set
  1. Model can be estimated with fewer parameters
  2. Allows sequential modeling RNN Disadvantages - ANSWER-- Runtime is O(T) and cannot be reduced by parallelization since propagation is sequential