





















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This set of notes begins by introducing the concept of Natural. Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the ...
Typology: Schemes and Mind Maps
1 / 93
This page cannot be seen from the preview
Don't miss anything!






















































































(^2) Authors: Francois Chaubard, Michael Fang, Guillaume Genthial, Rohit
Keyphrases: Natural Language Processing. Word Vectors. Singu- lar Value Decomposition. Skip-gram. Continuous Bag of Words (CBOW). Negative Sampling. Hierarchical Softmax. Word 2 Vec. This set of notes begins by introducing the concept of Natural Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the concept of representing words as numeric vectors. Lastly, we discuss popular approaches to designing word vectors.
We begin with a general discussion of what is NLP.
What’s so special about human (natural) language? Human language is a system specifically constructed to convey meaning, and is not produced by a physical manifestation of any kind. In that way, it is very different from vision or any other machine learning task. (^) Natural language is a dis- Most words are just symbols for an extra-linguistic entity : the crete/symbolic/categorical system word is a signifier that maps to a signified (idea or thing). For instance, the word "rocket" refers to the concept of a rocket, and by extension can designate an instance of a rocket. There are some exceptions, when we use words and letters for expressive sig- naling, like in "Whooompaa". On top of this, the symbols of language can be encoded in several modalities : voice, gesture, writing, etc that are transmitted via continuous signals to the brain, which itself appears to encode things in a continuous manner. (A lot of work in philosophy of language and linguistics has been done to conceptu- alize human language and distinguish words from their references, meanings, etc. Among others, see works by Wittgenstein, Frege, Rus- sell and Mill.)
There are different levels of tasks in NLP, from speech processing to semantic interpretation and discourse processing. The goal of NLP is to be able to design algorithms to allow computers to "understand"
cs224n: natural language processing with deep learning lecture notes: part i 2
natural language in order to perform some task. Example tasks come in varying level of difficulty:
Easy
Medium
Hard
The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any of our mod- els. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we first need to have some notion of similarity and difference between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Eu- clidean, etc).
2 Word Vectors
There are an estimated 13 million tokens for the English language but are they all completely unrelated? Feline to cat, hotel to motel? I think not. Thus, we want to encode word tokens each into some vector that represents a point in some sort of "word" space. This is paramount for a number of reasons but the most intuitive reason is that perhaps there actually exists some N-dimensional space (such that N ⌧ 13 million) that is sufficient to encode all semantics of our language. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might
cs224n: natural language processing with deep learning lecture notes: part i 4
The same kind of logic applies here however, the matrix X stores co-occurrences of words thereby becoming an affinity matrix. In this method we count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all the words in corpus. We display an example below. Let our corpus contain just three sentences and the window size be 1 : Using Word-Word Co-occurrence Matrix:
I enjoy flying.
I like NLP.
I like deep learning.
The resulting counts matrix will then be:
I like enjoy deep learning NLP f lying. I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 deep 0 1 0 0 1 0 0 0 learning 0 0 0 1 0 0 0 1 NLP 0 1 0 0 0 0 0 1 f lying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
We now perform SVD on X, observe the singular values (the diago- nal entries in the resulting S matrix), and cut them off at some index k based on the desired percentage variance captured:
 ki = 1 s i  | i^ V=^ | 1 s i
We then take the submatrix of U1: | V | ,1:k to be our word embedding matrix. This would thus give us a k-dimensional representation of every word in the vocabulary.
Applying SVD to X:
| V |
| V | | | | V | u 1 u 2 · · · | |
| V | s 1 0 · · · | V | 0 s 2 · · · .. .
| V | v 1 | V | v 2 .. .
cs224n: natural language processing with deep learning lecture notes: part i 5
Reducing dimensionality by selecting first k singular vectors :
| V |
k | | | V | u 1 u 2 · · · | |
k s 1 0 · · · k 0 s 2 · · · .. .
| V | v 1 k v 2 .. .
Both of these methods give us word vectors that are more than sufficient to encode semantic and syntactic (part of speech) informa- tion but are associated with many other problems:
As we see in the next section, iteration based methods solve many of these issues in a far more elegant manner.
4 Iteration Based Methods - Word 2 vec
For an overview of Word 2 vec , a note map can be found here : https:// myndbook.com/view/
A detailed summary of word 2 vec mod- els can also be found here [Rong, 2014 ]
Iteration-based methods capture cooc- currence of words one at a time instead of capturing all cooccurrence counts directly like in SVD methods.
Let us step back and try a new approach. Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context. The idea is to design a model whose parameters are the word vec- tors. Then, train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. Thus, we learn our word vectors. This idea is a very old
probability of a word in the sequence and the word next to it. We call this the bigram model and represent it as:
P ( w 1 , w 2 , · · · , w (^) n ) =
n
i = 2
P ( w (^) i | w (^) i 1 ) Bigram model:
P ( w 1 , w 2 , · · · , w (^) n ) =
n
i = 2
P ( w (^) i | w (^) i 1 )
Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence, but as we will see, this representation gets us pretty far along. Note in the Word-Word Matrix with a context of size 1 , we basically can learn these pairwise probabilities. But again, this would require computing and storing global information about a massive dataset. Now that we understand how we can think about a sequence of tokens having a probability, let us observe some example models that could learn these probabilities.
One approach is to treat {"The", "cat", ’over", "the’, "puddle"} as a context and from these words, be able to predict or generate the center word "jumped". This type of model we call a Continuous Bag of Words (CBOW) Model. CBOW Model: Predicting a center word from the surrounding context For each word, we want to learn 2 vectors
Let’s discuss the CBOW Model above in greater detail. First, we set up our known parameters. Let the known parameters in our model be the sentence represented by one-hot word vectors. The input one hot vectors or context we will represent with an x (^ c^ )^. And the output as y (^ c^ )^ and in the CBOW model, since we only have one output, so we just call this y which is the one hot vector of the known center word. Now let’s define our unknowns in our model.
Notation for CBOW Model:
We create two matrices, V 2 R n^ ⇥|^ V^ |^ and U 2 R |^ V^ |⇥^ n^. Where n is an arbitrary size which defines the size of our embedding space. V is the input word matrix such that the i-th column of V is the n- dimensional embedded vector for word w (^) i when it is an input to this model. We denote this n ⇥ 1 vector as v (^) i. Similarly, U is the output word matrix. The j-th row of U is an n-dimensional embedded vector for word w (^) j when it is an output of the model. We denote this row of U as u (^) j. Note that we do in fact learn two vectors for every word w (^) i (i.e. input word vector v (^) i and output word vector u (^) i ).
We breakdown the way this model works in these steps:
We get our embedded word vectors for the context (v (^) c m = V x (^ c^ ^ m^ )^ , v (^) c m + 1 = V x (^ c^ ^ m^ +^1 )^ ,.. ., v (^) c + m = V x (^ c^ +^ m^ )^2 R n^ )
Average these vectors to get ˆv = v^ c^ ^ m^ +^ v^ c^ ^ m 2 +m^1 +^ ...^ +^ v^ c^ +^ m 2 R n
Generate a score vector z = U vˆ 2 R |^ V^ |^. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
Turn the scores into probabilities ˆy = softmax ( z ) 2 R |^ V^ |^. The softmax is an operator that we’ll use very frequently. It transforms a vec- tor into a vector whose i-th component is (^) Â | Ve^ |ˆyi k = 1 e^ yˆk
.
Figure 1 : This image demonstrates how CBOW works and how we must learn the transfer matrices
So now that we have an understanding of how our model would work if we had a V and U , how would we learn these two matrices? Well, we need to create an objective function. Very often when we are trying to learn a probability from some true probability, we look to information theory to give us a measure of the distance between two distributions. Here, we use a popular choice of distance/loss measure, cross entropy H ( yˆ, y ). The intuition for the use of cross-entropy in the discrete case can be derived from the formulation of the loss function:
H ( yˆ, y ) =
| V |
j = 1
y (^) j log ( yˆ (^) j )
y ˆ 7! H ( yˆ, y ) is minimum when ˆy = y. Then, if we found a ˆy such that H ( yˆ, y ) is close to the minimum, we have ˆy ⇡ y. This means that our model is very good at predicting the center word!
Let us concern ourselves with the case at hand, which is that y is a one-hot vector. Thus we know that the above loss simplifies to simply: H ( yˆ, y ) = y (^) i log ( yˆ (^) i ) In this formulation, c is the index where the correct word’s one hot vector is 1. We can now consider the case where our predic- tion was perfect and thus ˆy (^) c = 1. We can then calculate H ( yˆ, y ) = 1 log ( 1 ) = 0. Thus, for a perfect prediction, we face no penalty or loss. Now let us consider the opposite case where our prediction was very bad and thus ˆy (^) c = 0.01. As before, we can calculate our loss to be H ( yˆ, y ) = 1 log ( 0.01 ) ⇡ 4.605. We can thus see that for proba- bility distributions, cross entropy provides us with a good measure of distance. We thus formulate our optimization objective as: To learn the vectors (the matrices U and V) CBOW defines a cost that measures how good it is at predicting the center word. Then, we optimize this cost by updating the matrices U and V thanks to stochastic gradient descent
independence assumption. In other words, given the center word, all output words are completely independent.
minimize J = log P ( w (^) c m ,... , w (^) c 1 , w (^) c + 1 ,... , w (^) c + m | w (^) c )
= log
2 m
j = 0,j 6 = m
P ( w (^) c m + j | w (^) c )
= log
2 m
j = 0,j 6 = m
P ( u (^) c m + j | v (^) c )
= log
2 m
j = 0,j 6 = m
exp ( u Tc m + j v (^) c ) Â | k^ V =^ | 1 exp^ (^ u^ Tk v^ c )
=
2 m
j = 0,j 6 = m
u Tc m + j v (^) c + 2 m log
| V |
k = 1
exp ( u (^) kT v (^) c )
With this objective function, we can compute the gradients with respect to the unknown parameters and at each iteration update them via Stochastic Gradient Descent. Only one probability vector ˆy is com- puted. Skip-gram treats each context word equally : the models computes the probability for each word of appear- ing in the context independently of its distance to the center word
Note that
J =
2 m
j = 0,j 6 = m
log P ( u (^) c m + j | v (^) c )
2 m
j = 0,j 6 = m
H ( yˆ, y (^) c m + j )
where H ( yˆ, y (^) c m + j ) is the cross-entropy between the probability vector ˆy and the one-hot vector y (^) c m + j.
Loss functions J for CBOW and Skip- Gram are expensive to compute because of the softmax normalization, where we sum over all | V | scores!
Lets take a second to look at the objective function. Note that the summation over | V | is computationally huge! Any update we do or evaluation of the objective function would take O (| V |) time which if we recall is in the millions. A simple idea is we could instead just approximate it. For every training step, instead of looping over the entire vocabu- lary, we can just sample several negative examples! We "sample" from a noise distribution (Pn ( w ) ) whose probabilities match the ordering of the frequency of the vocabulary. To augment our formulation of the problem to incorporate Negative Sampling, all we need to do is update the:
Mikolov et al. present Negative Sampling in Distributed Representations of Words and Phrases and their Compo- sitionality. While negative sampling is based on the Skip-Gram model, it is in fact optimizing a different objective. Consider a pair ( w, c ) of word and context. Did this pair come from the training data? Let’s denote by P ( D = 1 | w, c ) the probability that (w, c) came from the corpus data. Correspondingly, P ( D = 0 | w, c ) will be the probability that ( w, c ) did not come from the corpus data. First, let’s model P ( D = 1 | w, c ) with the sigmoid function: The sigmoid function s ( x ) = (^1) +^1 e x is the 1 D version of the softmax and can be used to model a probability
Figure 3 : Sigmoid function
P ( D = 1 | w, c, q ) = s ( v Tc v (^) w ) =
1 + e ( ^ v^ Tc^ v^ w^ ) Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it in- deed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maxi- mum likelihood approach of these two probabilities. (Here we take q to be the parameters of the model, and in our case it is V and U .)
q = argmax q
( w,c ) 2 D
( w,c ) 2 D˜
P ( D = 0 | w, c, q )
= argmax q
( w,c ) 2 D
( w,c ) 2 D˜
( 1 P ( D = 1 | w, c, q ))
= argmax q
( w,c ) 2 D
( w,c ) 2 D˜
log ( 1 P ( D = 1 | w, c, q ))
= argmax q
( w,c ) 2 D
log
1 + exp ( u Tw v (^) c )
( w,c ) 2 D˜
log ( 1
1 + exp ( u Tw v (^) c )
= argmax q
( w,c ) 2 D
log
1 + exp ( u Tw v (^) c )
( w,c ) 2 D˜
log (
1 + exp ( u Tw v (^) c )
Note that maximizing the likelihood is the same as minimizing the negative log likelihood
( w,c ) 2 D
log 1 1 + exp ( u Tw v (^) c )