







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
CS 7643 Quiz 4 – Concepts|Actual 2026 Update with complete solutions|Georgia Institute Of Technology CS 7643 Quiz 4 – Concepts|Notes. Used this to pass in quiz 4
Typology: Exams
1 / 13
This page cannot be seen from the preview
Don't miss anything!








Covers Structured Representations (Lesson 11), Language Models (Lesson 12), and Embeddings (Lesson 13). Conceptual Questions:
transduction. Other setups know as Encoder - Decoder OCR – given an image of a text, split that up into individual characters and try to recognize each one.
One to many : one input, sequence as output (eg. Image captioning model) One to one : no sequence involved, typical regression problems. RNNs solve the problems MLP (Multilayer Perceptron) have when used to model sequences.
Problem of the Vanilla RNN: Vanishing gradients Solution : LSTM architecture LSTM (Long Short-Term Memory) introduces the concept of gates – taking parts of the input to the cell, and multiply them together.
Update rule for LSTM for c_t its update has an additive element to take care of the vanishing gradients problem. Conditional language models and how to train them (teacher/student forcing), language metrics (how to calculate them), how knowledge distillation works Conditional Language Models
Per-word cross-entropy is the average of cross-entropy for all words in the sequence. Perplexity – geometric mean of the inverse probability of a sequence of words. As evaluation metric – the lower the perplexity, the better our model is. The perplexity of a discreet uniform distribution over K events is K (Coin toss has a perplexity of 2, fair die toss has a perplexity of 6). Training Feed the words one by one – after each step, project into high dim. space, turn into a probability distribution and calculate the loss using cross-entropy. Compute the overall loss when the whole sentence has been fed as the average of the losses for each word, and do backpropagation.
Teacher forcing : at the following time step we input the actual word present in the training data not the previous prediction. Allows the model to keep learning effectively even if it made a mistake previously. Knowledge distillation : The teacher model works well but is too slow or expensive to run. Student and Teacher will both make a prediction. We still use the target to compute the Student loss. We’ll encourage the Student model to align its (soft) predictions to those of the Teacher (distillation loss) – tells us that we also want to rank wolf and fox
Word2vec Probability equation Intrinsic/extrinsic evaluation