Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide, Exams of Advanced Education

Harvard University Advanced Education

CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide

Typology: Exams

2025/2026

Available from 06/23/2026

2026ReadytoReviseLatestReleased 🇺🇸

(1)

5.9K documents

1 / 12

This page cannot be seen from the preview

Don't miss anything!

CS-7643 Quiz 4 Exam – Deep

Learning Optimization &

Regularization Study Guide

Embedding - ANSWER-A learned map from entities to vectors that encodes

similarity

Graph Embedding - ANSWER-Optimize the objective that connected nodes have

more similar embeddings than unconnected nodes.

Task: convert nodes to vectors

- effectively unsupervised learning where nearest neighbors are similar

- these learned vectors are useful for downstream tasks

Multi-layer Perceptron (MLP) pain points for NLP - ANSWER-- Cannot easily

support variable-sized sequences as inputs or outputs

- No inherent temporal structure

- No practical way of holding state

- The size of the network grows with the maximum allowed size of the input or

output sequences

Discover Exams of Advanced Education Harvard University

Partial preview of the text

Download CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide and more Exams Advanced Education in PDF only on Docsity!

CS-7643 Quiz 4 Exam – Deep

Learning Optimization &

Regularization Study Guide

Embedding - ANSWER-A learned map from entities to vectors that encodes similarity Graph Embedding - ANSWER-Optimize the objective that connected nodes have more similar embeddings than unconnected nodes. Task: convert nodes to vectors

effectively unsupervised learning where nearest neighbors are similar
these learned vectors are useful for downstream tasks Multi-layer Perceptron (MLP) pain points for NLP - ANSWER-- Cannot easily support variable-sized sequences as inputs or outputs
No inherent temporal structure
No practical way of holding state
The size of the network grows with the maximum allowed size of the input or output sequences

Truncated Backpropagation through time - ANSWER-- Only backpropagate a RNN through T time steps Recurrent Neural Networks (RNN) - ANSWER-h(t) = activation(Uinput + Vh(t-1) + bias) y(t) = activation(W*h(t) + bias)

activation is typically the logistic function or tanh
outputs can also simply be h(t)
family of NN architectures for modeling sequences Training Vanilla RNN's difficulties - ANSWER-- Vanishing gradients
Since dx(t)/dx(t-1) = w^t
if w > 1: exploding gradients
if w < 1: vanishing gradients Long Short-Term Memory Network Gates and States - ANSWER-- f(t) = forget gate
i(t) = input gate
u(t) = candidate update gate
o(t) = output gate
c(t) = cell state
c(t) = f(t) * c(t - 1) + i(t) * u(t)

L(dist) = CE b/w student and teacher predictions L(student) = CE b/w predicted output and actual L = alpha * L(dist) + beta * L(student) Advantages:

may work well b/c of soft predictions of teacher model
if we don't have enough labeled text we can still train student model to align predictions Collobert and Weston Vector Idea - ANSWER-a word and its context is a positive training sample; a random word in that sample context gives a negative training sample Word2vec Overview - ANSWER-Word2vec - a framework for learning word vector Idea:
we have a large corpus of text
every word in. fixed vocabulary represented by a vector
Go through each position t in the text, which has a center word c and context words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability Word2vec Variants - ANSWER-Skip-Gram: Predict context words given center word

Continuous Bag of Words: Predict center word from (bag of) context words Word2vec Objective Function - ANSWER-- product over all possible center words

product over all words in the context window
P( w(t+j) | w(t); theta )
J(theta) = - 1 / T * log (L) Word2vec P( w(t+j) | w(t) ) - ANSWER-- Two sets of vectors for each word in vocabulary

u(w) for when w is the center word
v(o) for when w is a context word P( w(t+j) | w(t) ) = softmax( u(wt) * v(wt+j) ) Word2vec Expensive to Compute Solutions - ANSWER-1. Hierarchical Softmax
Negative Sampling Negative Sampling Intuition - ANSWER-- For each (w, c) pair, sample k negative pairs (w, c')

maximize probability real word appears and minimize the probability random word appears

the score of an edge f(e) is a similarity (dot product) between the source embedding and a transformed version of the destination embedding
f(e) = cos( theta(s) , theta(d) + theta(r) ) Graph Embedding is Slow: Reason and Solution - ANSWER-- Training time dominated by computing scores for "fake edges"
Corrupt a sub-batch of edges with the same set of random nodes Debiasing word2vec - ANSWER-- identify gender subspace with gendered words
project all words onto this subspace
subtract those projections from the original word Problem: Not that effective and bias pervades the word embedding space t-SNE things to remember - ANSWER-1. Run until it stabilizes

Set perplexity b/w 2 and N

perplexity loosely measures # neighbors
balances b/w local and global aspects of nodes

Re-run t-SNE multiple times to ensure we get the same shape

t-SNE general concept - ANSWER-- Maps inputs from high dimensional space to lower dimensions for visualization

recursively moves similar points closer and distance points further
expands dense clusters and contracts sparse cluster Teacher Forcing - ANSWER-- next input to model is not predicted value, but the actual value from the training data
allows model to train effectively even if a mistake was made
if used instead of hidden-to-hidden recurrence nodes, can allow for parallelization, but model becomes less powerful
emerges from MLE
issues may arise if network is later going to be used in "closed-loop" mode where output is fed back as input Skip-Gram Model: Loss/Objective Function - ANSWER-Loss - for each position t, we try to predict the context words within a fixed window size given some context word
multiple these probabilities to get a likelihood
L(theta) = product(product(P(w_(t+j) | w_(t) ; theta))
Objective function: J(theta) = - 1/T log(L(theta)) Skip-Gram Model: Calculate P(w_(t+j) | w_(t) ; theta) - ANSWER-- Two vectors for each word:

u_w when w is center word

a negative sample edge is found by taking. areal edge and replacing either the source or destination node
negative sampling is a bottleneck since there are many more negative edges than real edges Pytorch Big Graph: Edge Scores - ANSWER-- f(e) = cos(theta_s, theta_d + theta_r)
theta_s = source vertex
theta_d = destination vertex
theta_r = relation vector Structured Representation: State, Neighborhood, Propagation of Info - ANSWER- State - compactly represents all data we've seen (nodes) Neighborhood - What other elements to incorporate/how two nodes are connected (edges) Propagation: how to combine structured data to get new state/vector representations Non-Local Neural Network - ANSWER-- Allows it to learn it's own connectivity pattern
Does so in Data dependent way
it's called non-local because you don't have a specific local receptive field
y = 1/(c) * sum(f(x_i, x_j)g_j))
f is similarity function

g encodes the features of similarity Skip Gram Model: What is conditioned on what? - ANSWER-- Probability of a context word given a center word RNNs and LSTMs, how their update rules differ, and what they problems each have or solve (vanishing and/or exploding gradients) - ANSWER-- Solving Vanishing Gradients: The way it does so is by creating an internal memory state which is simply added to the processed input, which greatly reduces the multiplicative effect of small gradients. The time dependence and effects of previous inputs are controlled by an interesting concept called a forget gate, which determines which states are remembered or forgotten.
Input gate determines the extent to which the current timestamp input should be used
Forget gate determines the extent to which output of the previous timestamp state should be used
Output gate determines the output of the current timestamp. RNN Advantages - ANSWER-1. Allows generalization to varying size sequence lengths not seen in training set

Model can be estimated with fewer parameters
Allows sequential modeling RNN Disadvantages - ANSWER-- Runtime is O(T) and cannot be reduced by parallelization since propagation is sequential

CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide, Exams of Advanced Education

Related documents

Partial preview of the text

Download CS-7643 Quiz 4 Exam – Deep Learning Optimization & Regularization Study Guide and more Exams Advanced Education in PDF only on Docsity!

CS-7643 Quiz 4 Exam – Deep

Learning Optimization &

Regularization Study Guide