CS-7643 QUIZ 4 DEEP LEARNING OPTIMIZATION REGULARIZATION PRACTICE SCRIPT UPDATED 2026 TEST, Exams of Advanced Algorithms

CS-7643 QUIZ 4 DEEP LEARNING OPTIMIZATION REGULARIZATION PRACTICE SCRIPT UPDATED 2026 TESTED SOLUTIONS

Typology: Exams

2025/2026

Available from 01/27/2026

alcorbgeneralstore
alcorbgeneralstore 🇺🇸

29K documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS-7643 QUIZ 4 DEEP LEARNING
OPTIMIZATION REGULARIZATION
PRACTICE SCRIPT UPDATED 2026 TESTED
SOLUTIONS
Graph Embedding Answer: Optimize the objective that connected
nodes have more similar embeddings than unconnected nodes.
Task: convert nodes to vectors
- effectively unsupervised learning where nearest neighbors are similar
- these learned vectors are useful for downstream tasks
Multi-layer Perceptron (MLP) pain points for NLP Answer: - Cannot
easily support variable-sized sequences as inputs or outputs
- No inherent temporal structure
- No practical way of holding state
- The size of the network grows with the maximum allowed size of the
input or output sequences
Truncated Backpropagation through time Answer: - Only
backpropagate a RNN through T time steps
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download CS-7643 QUIZ 4 DEEP LEARNING OPTIMIZATION REGULARIZATION PRACTICE SCRIPT UPDATED 2026 TEST and more Exams Advanced Algorithms in PDF only on Docsity!

CS-7643 QUIZ 4 DEEP LEARNING

OPTIMIZATION REGULARIZATION

PRACTICE SCRIPT UPDATED 2026 TESTED

SOLUTIONS

⫸ Graph Embedding Answer: Optimize the objective that connected nodes have more similar embeddings than unconnected nodes. Task: convert nodes to vectors

  • effectively unsupervised learning where nearest neighbors are similar
  • these learned vectors are useful for downstream tasks ⫸ Multi-layer Perceptron (MLP) pain points for NLP Answer: - Cannot easily support variable-sized sequences as inputs or outputs
  • No inherent temporal structure
  • No practical way of holding state
  • The size of the network grows with the maximum allowed size of the input or output sequences ⫸ Truncated Backpropagation through time Answer: - Only backpropagate a RNN through T time steps

⫸ Recurrent Neural Networks (RNN) Answer: h(t) = activation(Uinput + Vh(t-1) + bias) y(t) = activation(W*h(t) + bias)

  • activation is typically the logistic function or tanh
  • outputs can also simply be h(t)
  • family of NN architectures for modeling sequences ⫸ Training Vanilla RNN's difficulties Answer: - Vanishing gradients
  • Since dx(t)/dx(t-1) = w^t
  • if w > 1: exploding gradients
  • if w < 1: vanishing gradients ⫸ Long Short-Term Memory Network Gates and States Answer: - f(t) = forget gate
  • i(t) = input gate
  • u(t) = candidate update gate
  • o(t) = output gate
  • c(t) = cell state
  • c(t) = f(t) * c(t - 1) + i(t) * u(t)
  • h(t) = hidden state
  • h(t) = o(t) * tanh(c(t))

L(dist) = CE b/w student and teacher predictions L(student) = CE b/w predicted output and actual L = alpha * L(dist) + beta * L(student) Advantages:

  • may work well b/c of soft predictions of teacher model
  • if we don't have enough labeled text we can still train student model to align predictions ⫸ Collobert and Weston Vector Idea Answer: a word and its context is a positive training sample; a random word in that sample context gives a negative training sample ⫸ Word2vec Overview Answer: Word2vec - a framework for learning word vector Idea:
  • we have a large corpus of text
  • every word in. fixed vocabulary represented by a vector
  • Go through each position t in the text, which has a center word c and context words o
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
  • Keep adjusting the word vectors to maximize this probability

⫸ Word2vec Variants Answer: Skip-Gram: Predict context words given center word Continuous Bag of Words: Predict center word from (bag of) context words ⫸ Word2vec Objective Function Answer: - product over all possible center words

  • product over all words in the context window
  • P( w(t+j) | w(t); theta )
  • J(theta) = - 1 / T * log (L) ⫸ Word2vec P( w(t+j) | w(t) ) Answer: - Two sets of vectors for each word in vocabulary
  1. u(w) for when w is the center word
  2. v(o) for when w is a context word P( w(t+j) | w(t) ) = softmax( u(wt) * v(wt+j) ) ⫸ Word2vec Expensive to Compute Solutions Answer: 1. Hierarchical Softmax
  3. Negative Sampling
  • Nearest Neighbors are semantically meaningful ⫸ Graph Embeddings Loss Function Answer: - Margin loss between the score of an edge f(e) and a negative sampled edge f(e')
  • Negative sampled edges are constructed by taking real edge and replacing either the source or destination vertex with a random node
  • the score of an edge f(e) is a similarity (dot product) between the source embedding and a transformed version of the destination embedding
  • f(e) = cos( theta(s) , theta(d) + theta(r) ) ⫸ Graph Embedding is Slow: Reason and Solution Answer: - Training time dominated by computing scores for "fake edges"
  • Corrupt a sub-batch of edges with the same set of random nodes ⫸ Debiasing word2vec Answer: - identify gender subspace with gendered words
  • project all words onto this subspace
  • subtract those projections from the original word Problem: Not that effective and bias pervades the word embedding space ⫸ t-SNE things to remember Answer: 1. Run until it stabilizes
  1. Set perplexity b/w 2 and N
  • perplexity loosely measures # neighbors
  • balances b/w local and global aspects of nodes
  1. Re-run t-SNE multiple times to ensure we get the same shape ⫸ t-SNE general concept Answer: - Maps inputs from high dimensional space to lower dimensions for visualization
  • recursively moves similar points closer and distance points further
  • expands dense clusters and contracts sparse cluster ⫸ Teacher Forcing Answer: - next input to model is not predicted value, but the actual value from the training data
  • allows model to train effectively even if a mistake was made
  • if used instead of hidden-to-hidden recurrence nodes, can allow for parallelization, but model becomes less powerful
  • emerges from MLE
  • issues may arise if network is later going to be used in "closed-loop" mode where output is fed back as input ⫸ Skip-Gram Model: Loss/Objective Function Answer: Loss - for each position t, we try to predict the context words within a fixed window size given some context word
  • multiple these probabilities to get a likelihood
  • choose a distribution that samples less frequent words likely ⫸ Word Embeddings as a graph Answer: - each word is a node with edge connections to context words ⫸ Pytorch Big Graph: Idea Answer: - Start with multi-relation graph (different edge types that encode different relations)
  • minimize margin loss b/w an edge score and a negative sampled edge
  • a negative sample edge is found by taking. areal edge and replacing either the source or destination node
  • negative sampling is a bottleneck since there are many more negative edges than real edges ⫸ Pytorch Big Graph: Edge Scores Answer: - f(e) = cos(theta_s, theta_d
  • theta_r)
  • theta_s = source vertex
  • theta_d = destination vertex
  • theta_r = relation vector ⫸ Structured Representation: State, Neighborhood, Propagation of Info Answer: State - compactly represents all data we've seen (nodes) Neighborhood - What other elements to incorporate/how two nodes are connected (edges)

Propagation: how to combine structured data to get new state/vector representations ⫸ Non-Local Neural Network Answer: - Allows it to learn it's own connectivity pattern

  • Does so in Data dependent way
  • it's called non-local because you don't have a specific local receptive field
  • y = 1/(c) * sum(f(x_i, x_j)g_j))
  • f is similarity function
  • g encodes the features of similarity ⫸ Skip Gram Model: What is conditioned on what? Answer: - Probability of a context word given a center word ⫸ RNNs and LSTMs, how their update rules differ, and what they problems each have or solve (vanishing and/or exploding gradients) Answer: - Solving Vanishing Gradients: The way it does so is by creating an internal memory state which is simply added to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a forget gate, which determines which states are remembered or forgotten.
  • Input gate determines the extent to which the current timestamp input should be used