CS224n: Natural Language Processing with Deep Learning 1, Schemes and Mind Maps of Architecture

This set of notes begins by introducing the concept of Natural. Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the ...

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 08/01/2022

fioh_ji
fioh_ji 🇰🇼

4.5

(70)

814 documents

1 / 93

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS224n: Natural Language Processing with Deep
Learning 11Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part I2
2Authors: Francois Chaubard, Michael
Fang, Guillaume Genthial, Rohit
Mundra, Richard Socher
Winter 2017
Keyphrases: Natural Language Processing. Word Vectors. Singu-
lar Value Decomposition. Skip-gram. Continuous Bag of Words
(CBOW). Negative Sampling. Hierarchical Softmax. Word2Vec.
This set of notes begins by introducing the concept of Natural
Language Processing (NLP) and the problems NLP faces today. We
then move forward to discuss the concept of representing words as
numeric vectors. Lastly, we discuss popular approaches to designing
word vectors.
1Introduction to Natural Language Processing
We begin with a general discussion of what is NLP.
1.1What is so special about NLP?
What’s so special about human (natural) language? Human language
is a system specifically constructed to convey meaning, and is not
produced by a physical manifestation of any kind. In that way, it is
very different from vision or any other machine learning task. Natural language is a dis-
crete/symbolic/categorical system
Most words are just symbols for an extra-linguistic entity : the
word is a signifier that maps to a signified (idea or thing).
For instance, the word "rocket" refers to the concept of a rocket,
and by extension can designate an instance of a rocket. There are
some exceptions, when we use words and letters for expressive sig-
naling, like in "Whooompaa". On top of this, the symbols of language
can be encoded in several modalities : voice, gesture, writing, etc
that are transmitted via continuous signals to the brain, which itself
appears to encode things in a continuous manner. (A lot of work in
philosophy of language and linguistics has been done to conceptu-
alize human language and distinguish words from their references,
meanings, etc. Among others, see works by Wittgenstein, Frege, Rus-
sell and Mill.)
1.2Examples of tasks
There are different levels of tasks in NLP, from speech processing to
semantic interpretation and discourse processing. The goal of NLP is
to be able to design algorithms to allow computers to "understand"
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d

Partial preview of the text

Download CS224n: Natural Language Processing with Deep Learning 1 and more Schemes and Mind Maps Architecture in PDF only on Docsity!

CS 224 n: Natural Language Processing with Deep

Learning 1 1 Course Instructors: Christopher

Lecture Notes: Part I 2 Manning, Richard Socher

(^2) Authors: Francois Chaubard, Michael Fang, Guillaume Genthial, Rohit

Winter 2017 Mundra, Richard Socher

Keyphrases: Natural Language Processing. Word Vectors. Singu- lar Value Decomposition. Skip-gram. Continuous Bag of Words (CBOW). Negative Sampling. Hierarchical Softmax. Word 2 Vec. This set of notes begins by introducing the concept of Natural Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the concept of representing words as numeric vectors. Lastly, we discuss popular approaches to designing word vectors.

1 Introduction to Natural Language Processing

We begin with a general discussion of what is NLP.

1. 1 What is so special about NLP?

What’s so special about human (natural) language? Human language is a system specifically constructed to convey meaning, and is not produced by a physical manifestation of any kind. In that way, it is very different from vision or any other machine learning task. (^) Natural language is a dis- Most words are just symbols for an extra-linguistic entity : the crete/symbolic/categorical system word is a signifier that maps to a signified (idea or thing). For instance, the word "rocket" refers to the concept of a rocket, and by extension can designate an instance of a rocket. There are some exceptions, when we use words and letters for expressive sig- naling, like in "Whooompaa". On top of this, the symbols of language can be encoded in several modalities : voice, gesture, writing, etc that are transmitted via continuous signals to the brain, which itself appears to encode things in a continuous manner. (A lot of work in philosophy of language and linguistics has been done to conceptu- alize human language and distinguish words from their references, meanings, etc. Among others, see works by Wittgenstein, Frege, Rus- sell and Mill.)

1. 2 Examples of tasks

There are different levels of tasks in NLP, from speech processing to semantic interpretation and discourse processing. The goal of NLP is to be able to design algorithms to allow computers to "understand"

cs224n: natural language processing with deep learning lecture notes: part i 2

natural language in order to perform some task. Example tasks come in varying level of difficulty:

Easy

  • Spell Checking
  • Keyword Search
  • Finding Synonyms

Medium

  • Parsing information from websites, documents, etc.

Hard

  • Machine Translation (e.g. Translate Chinese text to English)
  • Semantic Analysis (What is the meaning of query statement?)
  • Coreference (e.g. What does "he" or "it" refer to given a docu- ment?)
  • Question Answering (e.g. Answering Jeopardy questions).

1. 3 How to represent words?

The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any of our mod- els. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we first need to have some notion of similarity and difference between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Eu- clidean, etc).

2 Word Vectors

There are an estimated 13 million tokens for the English language but are they all completely unrelated? Feline to cat, hotel to motel? I think not. Thus, we want to encode word tokens each into some vector that represents a point in some sort of "word" space. This is paramount for a number of reasons but the most intuitive reason is that perhaps there actually exists some N-dimensional space (such that N ⌧ 13 million) that is sufficient to encode all semantics of our language. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might

cs224n: natural language processing with deep learning lecture notes: part i 4

3. 2 Window based Co-occurrence Matrix

The same kind of logic applies here however, the matrix X stores co-occurrences of words thereby becoming an affinity matrix. In this method we count the number of times each word appears inside a window of a particular size around the word of interest. We calculate this count for all the words in corpus. We display an example below. Let our corpus contain just three sentences and the window size be 1 : Using Word-Word Co-occurrence Matrix:

  • Generate | V | ⇥ | V | co-occurrence matrix, X.
  • Apply SVD on X to get X = USV T^.
  • Select the first k columns of U to get a k-dimensional word vectors.
  • Â^ ki = 1 s i  | i^ V =^ | 1 s i^ indicates the amount of variance captured by the first k dimensions.
  1. I enjoy flying.

  2. I like NLP.

  3. I like deep learning.

The resulting counts matrix will then be:

X =

I like enjoy deep learning NLP f lying. I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 deep 0 1 0 0 1 0 0 0 learning 0 0 0 1 0 0 0 1 NLP 0 1 0 0 0 0 0 1 f lying 0 0 1 0 0 0 0 1

. 0 0 0 0 1 1 1 0

3. 3 Applying SVD to the cooccurrence matrix

We now perform SVD on X, observe the singular values (the diago- nal entries in the resulting S matrix), and cut them off at some index k based on the desired percentage variance captured:

 ki = 1 s i  | i^ V=^ | 1 s i

We then take the submatrix of U1: | V | ,1:k to be our word embedding matrix. This would thus give us a k-dimensional representation of every word in the vocabulary.

Applying SVD to X:

| V |

| V | X

| V | | | | V | u 1 u 2 · · · | |

| V | s 1 0 · · · | V | 0 s 2 · · · .. .

| V | v 1 | V | v 2 .. .

cs224n: natural language processing with deep learning lecture notes: part i 5

Reducing dimensionality by selecting first k singular vectors :

| V |

| V | Xˆ

k | | | V | u 1 u 2 · · · | |

k s 1 0 · · · k 0 s 2 · · · .. .

| V | v 1 k v 2 .. .

Both of these methods give us word vectors that are more than sufficient to encode semantic and syntactic (part of speech) informa- tion but are associated with many other problems:

  • The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. Computational cost for a m ⇥ n matrix is O ( mn 2 )
  • The matrix is extremely sparse since most words do not co-occur.
  • The matrix is very high dimensional in general ( ⇡ 10 6 ⇥ 10 6 )
  • Quadratic cost to train (i.e. to perform SVD)
  • Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency However, count-based method make an Some solutions to exist to resolve some of the issues discussed above: efficient use of the statistics
  • Ignore function words such as "the", "he", "has", etc.
  • Apply a ramp window – i.e. weight the co-occurrence count based on distance between the words in the document.
  • Use Pearson correlation and set negative counts to 0 instead of using just raw count.

As we see in the next section, iteration based methods solve many of these issues in a far more elegant manner.

4 Iteration Based Methods - Word 2 vec

For an overview of Word 2 vec , a note map can be found here : https:// myndbook.com/view/

A detailed summary of word 2 vec mod- els can also be found here [Rong, 2014 ]

Iteration-based methods capture cooc- currence of words one at a time instead of capturing all cooccurrence counts directly like in SVD methods.

Let us step back and try a new approach. Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context. The idea is to design a model whose parameters are the word vec- tors. Then, train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. Thus, we learn our word vectors. This idea is a very old

cs224n: natural language processing with deep learning lecture notes: part i 7

probability of a word in the sequence and the word next to it. We call this the bigram model and represent it as:

P ( w 1 , w 2 , · · · , w (^) n ) =

n

i = 2

P ( w (^) i | w (^) i 1 ) Bigram model:

P ( w 1 , w 2 , · · · , w (^) n ) =

n

i = 2

P ( w (^) i | w (^) i 1 )

Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence, but as we will see, this representation gets us pretty far along. Note in the Word-Word Matrix with a context of size 1 , we basically can learn these pairwise probabilities. But again, this would require computing and storing global information about a massive dataset. Now that we understand how we can think about a sequence of tokens having a probability, let us observe some example models that could learn these probabilities.

4. 2 Continuous Bag of Words Model (CBOW)

One approach is to treat {"The", "cat", ’over", "the’, "puddle"} as a context and from these words, be able to predict or generate the center word "jumped". This type of model we call a Continuous Bag of Words (CBOW) Model. CBOW Model: Predicting a center word from the surrounding context For each word, we want to learn 2 vectors

  • v: (input vector) when the word is in the context
  • u: (output vector) when the word is in the center

Let’s discuss the CBOW Model above in greater detail. First, we set up our known parameters. Let the known parameters in our model be the sentence represented by one-hot word vectors. The input one hot vectors or context we will represent with an x (^ c^ )^. And the output as y (^ c^ )^ and in the CBOW model, since we only have one output, so we just call this y which is the one hot vector of the known center word. Now let’s define our unknowns in our model.

Notation for CBOW Model:

  • w (^) i : Word i from vocabulary V
  • V 2 R n^ ⇥|^ V^ |^ : Input word matrix
  • v (^) i : i-th column of V , the input vector representation of word w (^) i
  • U 2 R |^ V^ |⇥^ n^ : Output word matrix
  • u (^) i : i-th row of U , the output vector representation of word w (^) i

We create two matrices, V 2 R n^ ⇥|^ V^ |^ and U 2 R |^ V^ |⇥^ n^. Where n is an arbitrary size which defines the size of our embedding space. V is the input word matrix such that the i-th column of V is the n- dimensional embedded vector for word w (^) i when it is an input to this model. We denote this n ⇥ 1 vector as v (^) i. Similarly, U is the output word matrix. The j-th row of U is an n-dimensional embedded vector for word w (^) j when it is an output of the model. We denote this row of U as u (^) j. Note that we do in fact learn two vectors for every word w (^) i (i.e. input word vector v (^) i and output word vector u (^) i ).

We breakdown the way this model works in these steps:

  1. We generate our one hot word vectors for the input context of size m : (x (^ c^ ^ m^ )^ ,... , x (^ c^ ^1 )^ , x (^ c^ +^1 )^ ,... , x (^ c^ +^ m^ )^2 R |^ V^ |^ ).

cs224n: natural language processing with deep learning lecture notes: part i 8

  1. We get our embedded word vectors for the context (v (^) c m = V x (^ c^ ^ m^ )^ , v (^) c m + 1 = V x (^ c^ ^ m^ +^1 )^ ,.. ., v (^) c + m = V x (^ c^ +^ m^ )^2 R n^ )

  2. Average these vectors to get ˆv = v^ c^ ^ m^ +^ v^ c^ ^ m 2 +m^1 +^ ...^ +^ v^ c^ +^ m 2 R n

  3. Generate a score vector z = U vˆ 2 R |^ V^ |^. As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.

  4. Turn the scores into probabilities ˆy = softmax ( z ) 2 R |^ V^ |^. The softmax is an operator that we’ll use very frequently. It transforms a vec- tor into a vector whose i-th component is (^) Â | Ve^ |ˆyi k = 1 e^ yˆk

.

  • exponentiate to make positive
  • Dividing by (^) Â | k^ V =^ | 1 e yˆ^ k^ normalizes the vector (Â nk = 1 yˆ (^) k = 1) to give probability
  1. We desire our probabilities generated, ˆy 2 R |^ V^ |^ , to match the true probabilities, y 2 R |^ V^ |^ , which also happens to be the one hot vector of the actual word.

Figure 1 : This image demonstrates how CBOW works and how we must learn the transfer matrices

So now that we have an understanding of how our model would work if we had a V and U , how would we learn these two matrices? Well, we need to create an objective function. Very often when we are trying to learn a probability from some true probability, we look to information theory to give us a measure of the distance between two distributions. Here, we use a popular choice of distance/loss measure, cross entropy H ( yˆ, y ). The intuition for the use of cross-entropy in the discrete case can be derived from the formulation of the loss function:

H ( yˆ, y ) =

| V |

Â

j = 1

y (^) j log ( yˆ (^) j )

y ˆ 7! H ( yˆ, y ) is minimum when ˆy = y. Then, if we found a ˆy such that H ( yˆ, y ) is close to the minimum, we have ˆy ⇡ y. This means that our model is very good at predicting the center word!

Let us concern ourselves with the case at hand, which is that y is a one-hot vector. Thus we know that the above loss simplifies to simply: H ( yˆ, y ) = y (^) i log ( yˆ (^) i ) In this formulation, c is the index where the correct word’s one hot vector is 1. We can now consider the case where our predic- tion was perfect and thus ˆy (^) c = 1. We can then calculate H ( yˆ, y ) = 1 log ( 1 ) = 0. Thus, for a perfect prediction, we face no penalty or loss. Now let us consider the opposite case where our prediction was very bad and thus ˆy (^) c = 0.01. As before, we can calculate our loss to be H ( yˆ, y ) = 1 log ( 0.01 ) ⇡ 4.605. We can thus see that for proba- bility distributions, cross entropy provides us with a good measure of distance. We thus formulate our optimization objective as: To learn the vectors (the matrices U and V) CBOW defines a cost that measures how good it is at predicting the center word. Then, we optimize this cost by updating the matrices U and V thanks to stochastic gradient descent

cs224n: natural language processing with deep learning lecture notes: part i 10

independence assumption. In other words, given the center word, all output words are completely independent.

minimize J = log P ( w (^) c m ,... , w (^) c 1 , w (^) c + 1 ,... , w (^) c + m | w (^) c )

= log

2 m

j = 0,j 6 = m

P ( w (^) c m + j | w (^) c )

= log

2 m

j = 0,j 6 = m

P ( u (^) c m + j | v (^) c )

= log

2 m

j = 0,j 6 = m

exp ( u Tc m + j v (^) c ) Â | k^ V =^ | 1 exp^ (^ u^ Tk v^ c )

=

2 m

Â

j = 0,j 6 = m

u Tc m + j v (^) c + 2 m log

| V |

Â

k = 1

exp ( u (^) kT v (^) c )

With this objective function, we can compute the gradients with respect to the unknown parameters and at each iteration update them via Stochastic Gradient Descent. Only one probability vector ˆy is com- puted. Skip-gram treats each context word equally : the models computes the probability for each word of appear- ing in the context independently of its distance to the center word

Note that

J =

2 m

Â

j = 0,j 6 = m

log P ( u (^) c m + j | v (^) c )

2 m

Â

j = 0,j 6 = m

H ( yˆ, y (^) c m + j )

where H ( yˆ, y (^) c m + j ) is the cross-entropy between the probability vector ˆy and the one-hot vector y (^) c m + j.

4. 4 Negative Sampling

Loss functions J for CBOW and Skip- Gram are expensive to compute because of the softmax normalization, where we sum over all | V | scores!

Lets take a second to look at the objective function. Note that the summation over | V | is computationally huge! Any update we do or evaluation of the objective function would take O (| V |) time which if we recall is in the millions. A simple idea is we could instead just approximate it. For every training step, instead of looping over the entire vocabu- lary, we can just sample several negative examples! We "sample" from a noise distribution (Pn ( w ) ) whose probabilities match the ordering of the frequency of the vocabulary. To augment our formulation of the problem to incorporate Negative Sampling, all we need to do is update the:

  • objective function

cs224n: natural language processing with deep learning lecture notes: part i 11

  • gradients
  • update rules

Mikolov et al. present Negative Sampling in Distributed Representations of Words and Phrases and their Compo- sitionality. While negative sampling is based on the Skip-Gram model, it is in fact optimizing a different objective. Consider a pair ( w, c ) of word and context. Did this pair come from the training data? Let’s denote by P ( D = 1 | w, c ) the probability that (w, c) came from the corpus data. Correspondingly, P ( D = 0 | w, c ) will be the probability that ( w, c ) did not come from the corpus data. First, let’s model P ( D = 1 | w, c ) with the sigmoid function: The sigmoid function s ( x ) = (^1) +^1 e x is the 1 D version of the softmax and can be used to model a probability

Figure 3 : Sigmoid function

P ( D = 1 | w, c, q ) = s ( v Tc v (^) w ) =

1 + e (^ v^ Tc^ v^ w^ ) Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it in- deed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maxi- mum likelihood approach of these two probabilities. (Here we take q to be the parameters of the model, and in our case it is V and U .)

q = argmax q

( w,c ) 2 D

P ( D = 1 | w, c, q ) ’

( w,c ) 2 D˜

P ( D = 0 | w, c, q )

= argmax q

( w,c ) 2 D

P ( D = 1 | w, c, q ) ’

( w,c ) 2 D˜

( 1 P ( D = 1 | w, c, q ))

= argmax q

Â

( w,c ) 2 D

log P ( D = 1 | w, c, q ) + Â

( w,c ) 2 D˜

log ( 1 P ( D = 1 | w, c, q ))

= argmax q

Â

( w,c ) 2 D

log

1 + exp ( u Tw v (^) c )

+ Â

( w,c ) 2 D˜

log ( 1

1 + exp ( u Tw v (^) c )

= argmax q

Â

( w,c ) 2 D

log

1 + exp ( u Tw v (^) c )

+ Â

( w,c ) 2 D˜

log (

1 + exp ( u Tw v (^) c )

Note that maximizing the likelihood is the same as minimizing the negative log likelihood

J = Â

( w,c ) 2 D

log 1 1 + exp ( u Tw v (^) c )

Â

( w,c ) 2 D˜

log ( 1 1 + exp ( u Tw v (^) c )

Note that ˜D is a "false" or "negative" corpus. Where we would have sentences like "stock boil fish is toy". Unnatural sentences that should get a low probability of ever occurring. We can generate ˜D on the fly by randomly sampling this negative from the word bank.

cs224n: natural language processing with deep learning lecture notes: part i 13

vector v (^) n ( w,i ). So n ( w, 1 ) is the root, while n ( w, L ( w )) is the father of w. Now for each inner node n, we arbitrarily choose one of its children and call it ch ( n ) (e.g. always the left node). Then, we can compute the probability as

P ( w | w (^) i ) =

L ( w ) 1

j = 1

s ([ n ( w, j + 1 ) = ch ( n ( w, j ))] · v Tn ( w,j ) v (^) w (^) i )

where

[ x ] =

1 if x is true 1 otherwise . and s (·) is the sigmoid function. This formula is fairly dense, so let’s examine it more closely. First, we are computing a product of terms based on the shape of the path from the root (n ( w, 1 ) ) to the leaf (w). If we assume ch ( n ) is always the left node of n, then term [ n ( w, j + 1 ) = ch ( n ( w, j ))] returns 1 when the path goes left, and - 1 if right. Furthermore, the term [ n ( w, j + 1 ) = ch ( n ( w, j ))] provides normal- ization. At a node n, if we sum the probabilities for going to the left and right node, you can check that for any value of v Tn v (^) w (^) i ,

s ( v Tn v (^) w (^) i ) + s ( v Tn v (^) w (^) i ) = 1 The normalization also ensures that  | w^ V =^ | 1 P ( w | w (^) i ) = 1, just as in the original softmax. Finally, we compare the similarity of our input vector v (^) w (^) i to each inner node vector v Tn ( w,j ) using a dot product. Let’s run through an example. Taking w 2 in Figure 4 , we must take two left edges and then a right edge to reach w 2 from the root, so

P ( w 2 | w (^) i ) = p ( n ( w 2 , 1 ) , left ) · p ( n ( w 2 , 2 ) , left ) · p ( n ( w 2 , 3 ) , right ) = s ( v Tn ( w 2 ,1 ) v (^) w (^) i ) · s ( v Tn ( w 2 ,2 ) v (^) w (^) i ) · s ( v Tn ( w 2 ,3 ) v (^) w (^) i ) To train the model, our goal is still to minimize the negative log likelihood log P ( w | w (^) i ). But instead of updating output vectors per word, we update the vectors of the nodes in the binary tree that are in the path from root to leaf node. The speed of this method is determined by the way in which the binary tree is constructed and words are assigned to leaf nodes. Mikolov et al. use a binary Huffman tree, which assigns frequent words shorter paths in the tree.

References

[Bengio et al., 2003 ] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. ( 2003 ). A neural probabilistic language model. J. Mach. Learn. Res., 3 : 1137 – 1155.

cs224n: natural language processing with deep learning lecture notes: part i 14

[Collobert et al., 2011 ] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. P. ( 2011 ). Natural language processing (almost) from scratch. CoRR, abs/ 1103. 0398. [Mikolov et al., 2013 ] Mikolov, T., Chen, K., Corrado, G., and Dean, J. ( 2013 ). Efficient estimation of word representations in vector space. CoRR, abs/ 1301. 3781. [Rong, 2014 ] Rong, X. ( 2014 ). word 2 vec parameter learning explained. CoRR, abs/ 1411. 2738. [Rumelhart et al., 1988 ] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. ( 1988 ). Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696 – 699. MIT Press, Cambridge, MA, USA.

cs224n: natural language processing with deep learning lecture notes: part ii 2

1. 2 Co-occurrence Matrix

Co-occurrence Matrix:

  • X: word-word co-occurrence matrix
  • X (^) ij : number of times word j occur in the context of word i
  • X (^) i = Â (^) k X (^) ik : the number of times any word k appears in the context of word i
  • Pij = P ( w (^) j | w (^) i ) = XX^ ij (^) i : the probability of j appearing in the context of word i

Let X denote the word-word co-occurrence matrix, where X (^) ij indi- cates the number of times word j occur in the context of word i. Let X (^) i = (^) Â (^) k X (^) ik be the number of times any word k appears in the con-

text of word i. Finally, let Pij = P ( w (^) j | w (^) i ) = X (^) ij X (^) i be the probability of j appearing in the context of word i. Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost.

1. 3 Least Squares Objective

Recall that for the skip-gram model, we use softmax to compute the probability of word j appears in the context of word i:

Q (^) ij =

exp (~u Tj ~v (^) i ) Â Ww = 1 exp^ (~u^ Tw~v^ i ) Training proceeds in an on-line, stochastic fashion, but the implied global cross-entropy loss can be calculated as:

J = Â

i 2 corpus

Â

j 2 context ( i )

log Q (^) ij

As the same words i and j can appear multiple times in the corpus, it is more efficient to first group together the same values for i and j:

J =

W

Â

i = 1

W

Â

j = 1

X (^) ij log Q (^) ij

where the value of co-occurring frequency is given by the co- occurrence matrix X. One significant drawback of the cross-entropy loss is that it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary. Instead, we use a least square objective in which the normalization factors in P and Q are discarded:

Jˆ =

W

Â

i = 1

W

Â

j = 1

X (^) i ( Pˆij Qˆ (^) ij ) 2

where ˆPij = X (^) ij and Qˆ (^) ij = exp (~u Tj ~v (^) i ) are the unnormalized distributions. This formulation introduces a new problem – X (^) ij often takes on very large values and makes the optimization difficult. An effective change is to minimize the squared error of the logarithms of Pˆ and Qˆ:

cs224n: natural language processing with deep learning lecture notes: part ii 3

J^ ˆ =

W

Â

i = 1

W

Â

j = 1

X (^) i ( log ( Pˆ ) (^) ij log ( Qˆ (^) ij )) 2

W

Â

i = 1

W

Â

j = 1

X (^) i (~u Tj ~v (^) i log X (^) ij ) 2

Another observation is that the weighting factor X (^) i is not guaran- teed to be optimal. Instead, we introduce a more general weighting function, which we are free to take to depend on the context word as well:

Jˆ =

W

Â

i = 1

W

Â

j = 1

f ( X (^) ij )(~u Tj ~v (^) i log X (^) ij ) 2

1. 4 Conclusion

In conclusion, the GloVe model efficiently leverages global statistical information by training only on the nonzero elements in a word- word co-occurrence matrix, and produces a vector space with mean- ingful sub-structure. It consistently outperforms word 2 vec on the word analogy task, given the same corpus, vocabulary, window size, and training time. It achieves better results faster, and also obtains the best results irrespective of speed.

2 Evaluation of Word Vectors

So far, we have discussed methods such as the Word 2 Vec and GloVe methods to train and discover latent vector representations of natural language words in a semantic space. In this section, we discuss how we can quantitatively evaluate the quality of word vectors produced by such techniques.

2. 1 Intrinsic Evaluation

Intrinsic evaluation of word vectors is the evaluation of a set of word vectors generated by an embedding technique (such as Word 2 Vec or GloVe) on specific intermediate subtasks (such as analogy comple- tion). These subtasks are typically simple and fast to compute and thereby allow us to help understand the system used to generate the word vectors. An intrinsic evaluation should typically return to us a number that indicates the performance of those word vectors on the evaluation subtask.

Figure 1 : The left subsystem (red) being expensive to train is modified by substituting with a simpler subsystem (green) for intrinsic evaluation.

Motivation: Let us consider an example where our final goal is to create a question answering system which uses word vectors

cs224n: natural language processing with deep learning lecture notes: part ii 5

a : b : : c :? The intrinsic evaluation system then identifies the word vector which maximizes the cosine similarity:

d = argmax i

( x (^) b x (^) a + x (^) c ) T^ x (^) i k x (^) b x (^) a + x (^) c k

This metric has an intuitive interpretation. Ideally, we want x (^) b x (^) a = x (^) d x (^) c (For instance, queen – king = actress – actor). This implies that we want x (^) b x (^) a + x (^) c = x (^) d. Thus we identify the vector x (^) d which maximizes the normalized dot-product between the two word vectors (i.e. cosine similarity). Using intrinsic evaluation techniques such as word-vector analo- gies should be handled with care (keeping in mind various aspects of the corpus used for pre-training). For instance, consider analogies of the form: City 1 : State containing City 1 : : City 2 : State containing City 2

Input Result Produced Chicago : Illinois : : Houston Texas Chicago : Illinois : : Philadelphia Pennsylvania Chicago : Illinois : : Phoenix Arizona Chicago : Illinois : : Dallas Texas Chicago : Illinois : : Jacksonville Florida Chicago : Illinois : : Indianapolis Indiana Chicago : Illinois : : Austin Texas Chicago : Illinois : : Detroit Michigan Chicago : Illinois : : Memphis Tennessee Chicago : Illinois : : Boston Massachusetts

Table 1 : Here are semantic word vector analogies (intrinsic evaluation) that may suffer from different cities having the same name

In many cases above, there are multiple cities/towns/villages with the same name across the US. Thus, many states would qualify as the right answer. For instance, there are at least 10 places in the US called Phoenix and thus, Arizona need not be the only correct response. Let us now consider analogies of the form: Capital City 1 : Country 1 : : Capital City 2 : Country 2 In many of the cases above, the resulting city produced by this task has only been the capital in the recent past. For instance, prior to 1997 the capital of Kazakhstan was Almaty. Thus, we can anticipate other issues if our corpus is dated. The previous two examples demonstrated semantic testing using word vectors. We can also test syntax using word vector analogies. The following intrinsic evaluation tests the word vectors’ ability to capture the notion of superlative adjectives: Similarly, the intrinsic evaluation shown below tests the word vectors’ ability to capture the notion of past tense:

cs224n: natural language processing with deep learning lecture notes: part ii 6

Input Result Produced Abuja : Nigeria : : Accra Ghana Abuja : Nigeria : : Algiers Algeria Abuja : Nigeria : : Amman Jordan Abuja : Nigeria : : Ankara Turkey Abuja : Nigeria : : Antananarivo Madagascar Abuja : Nigeria : : Apia Samoa Abuja : Nigeria : : Ashgabat Turkmenistan Abuja : Nigeria : : Asmara Eritrea Abuja : Nigeria : : Astana Kazakhstan

Table 2 : Here are semantic word vector analogies (intrinsic evaluation) that may suffer from countries having different capitals at different points in time

Input Result Produced bad : worst : : big biggest bad : worst : : bright brightest bad : worst : : cold coldest bad : worst : : cool coolest bad : worst : : dark darkest bad : worst : : easy easiest bad : worst : : fast fastest bad : worst : : good best bad : worst : : great greatest

Table 3 : Here are syntactic word vector analogies (intrinsic evaluation) that test the notion of superlative adjectives

2. 4 Intrinsic Evaluation Tuning Example: Analogy Evaluations

Some parameters we might consider tuning for a word embedding technique on intrinsic evaluation tasks are:

  • Dimension of word vectors
  • Corpus size
  • Corpus souce/type
  • Context window size
  • Context symmetry Can you think of other hyperparame- ters tunable at this stage?

We now explore some of the hyperparameters in word vector em- bedding techniques (such as Word 2 Vec and GloVe) that can be tuned using an intrinsic evaluation system (such as an analogy completion system). Let us first see how different methods for creating word- vector embeddings have performed (in recent research work) under the same hyperparameters on an analogy evaluation task: