Download Vector Semantics and more Lecture notes Vector Analysis in PDF only on Docsity! zB fun
sentenceSsione
probability,
at OF AMMAl SS
sore MOCE] £5
ae a et mere Ate
ledge 2 =
ote Sinton
“aco | ae
iaesinceer FEBS aie
‘put evegi es te =,
algorithm
funn representation tc Taig ay = Semantic
WORUS? 2 nie
os . ware ariguages
5 om OS Sales form
COrpuS
conser | a | quagissens
“re qumberes ETD processing
ee egwven dialogue = _ using i
np iaical SS a,
ve Hi models == ran ~
a)
Vector Semantics
Introduction
Dan Jurafsky Why vector models of meaning? computing the similarity between words “fast” is similar to “rapid” “tall” is similar to “height” Question answering: Q: “How tall is Mt. Everest?” Candidate A: “The official height of Mount Everest is 29029 feet” 2 Dan Jurafsky Problems with thesaurus-‐based meaning • We don’t have a thesaurus for every language • We can’t have a thesaurus for every year • For historical linguistics, we need to compare word meanings in year t to year t+1 • Thesauruses have problems with recall • Many words and phrases are missing • Thesauri work less well for verbs, adjectives Dan Jurafsky Distributional models of meaning = vector-‐space models of meaning = vector semantics Intuitions: Zellig Harris (1954): • “oculist and eye-‐doctor … occur in almost the same environments” • “If A and B have almost identical environments we say that they are synonyms.” Firth (1957): • “You shall know a word by the company it keeps!” 6 Dan Jurafsky Intuition of distributional word similarity • Nida example: Suppose I asked you what is tesgüino? A bottle of tesgüino is on the table Everybody likes tesgüino Tesgüino makes you drunk We make tesgüino out of corn. • From context words humans can guess tesgüinomeans • an alcoholic beverage like beer • Intuition for algorithm: • Two words are similar if they have similar word contexts. zB Lal
sentences sian crt SS
probability
= orammang
rd aii =.
ooMOdel aon important
ui @ ‘sa ea oA oni
ledge & =
<0 5 > inforniation™
F nstead cDiz
{sree EB
mat go Pee
we algorithm.
human Pepresentation
pS semantio.
WOPdSE se
os represent aasis @E
"Se lariglagessan
ri =nuinher=s<3a
= Bowen dialogue 5 = _USINg §
np lexical => oo
we Hin models => rom
ser ETS
toms! tai
Vector Semantics
Words and co-occurrence
vectors
Dan Jurafsky Co-‐occurrence Matrices • We represent how often a word occurs in a document • Term-‐document matrix • Or how often a word occurs with another • Term-‐term matrix (or word-‐word co-‐occurrence matrix or word-‐context matrix)11 Dan Jurafsky As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 Term-‐document matrix • Each cell: count of word w in a document d: • Each document is a count vector in ℕv: a column below 12 Dan Jurafsky The words in a term-‐document matrix • Two words are similar if their vectors are similar 15 As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 Dan Jurafsky The word-‐word or word-‐context matrix • Instead of entire documents, use smaller contexts • Paragraph • Window of ± 4 words • A word is now defined by a vector over counts of context words • Instead of each vector being of length D • Each vector is now of length |V| • The word-‐word matrix is |V|x|V|16 Dan Jurafsky Word-‐Word matrix Sample contexts ± 7 words 17 aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 19.1 • WORDS AND VECTORS 3 tors of numbers representing the terms (words) that occur within the collection (Salton, 1971). In information retrieval these numbers are called the term weight, aterm weight function of the term’s frequency in the document. More generally, the term-document matrix X has V rows (one for each word type in the vocabulary) and D columns (one for each document in the collection). Each column represents a document. A query is also represented by a vector q of length |V |. We go about finding the most relevant document to query by finding the document whose vector is most similar to the query; later in the chapter we’ll introduce some of the components of this process: the tf-idf term weighting, and the cosine similarity metric. But now let’s turn to the insight of vector semantics for representing the meaning of words. The idea is that we can also represent each word by a vector, now a row vector representing the counts of the word’s occurrence in each document. Thus the vectors for fool [37,58,1,5] and clown [5,117,0,0] are more similar to each other (occurring more in the comedies) while battle [1,1,8,15] and soldier [2,2,12,36] are more similar to each other (occurring less in the comedies). More commonly used for vector semantics than this term-document matrix is an alternative formulation, the term-term matrix, more commonly called the word-term-term matrix word matrix oro the term-context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality |V |⇥ |V | and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. The context could be the document, in which case the cell represents the number of times the two words appear in the same document. It is most common, however, to use smaller contexts, such as a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here are 7-word windows surrounding four sample words from the Brown corpus (just one example of each word): sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the For each word we collect the counts (from the windows around each occurrence) of the occurrences of context words. Fig. 17.2 shows a selection from the word-word co-occurrence matrix computed from the Brown corpus for these four words. aardvark ... computer data pinch result sugar ... apricot 0 ... 0 0 1 0 1 pineapple 0 ... 0 0 1 0 1 digital 0 ... 2 1 0 1 0 information 0 ... 1 6 0 4 0 Figure 19.2 Co-occurrence vectors for four words, computed from the Brown corpus, showing only six of the dimensions (hand-picked for pedagogical purposes). Note that a real vector would be vastly more sparse. The shading in Fig. 17.2 makes clear the intuition that the two words apricot and pineapple are more similar (both pinch and sugar tend to occur in their window) while digital and information are more similar. Note that |V |, the length of the vector, is generally the size of the vocabulary, usually between 10,000 and 50,000 words (using the most frequent words in the … … r Z twin
senteftes sar crt SS
probability
at OF AMMAl SS
cal QUE] $3 = =
Sy oiaine ate
ledge & =
27
<0 5 aso
mb een. a = [ule
nse te ieoia |; machine "ME,
~PSitesg | Esai? ean
oi 0 GSO Seo
ei algorithm.
funn representation te Taig ay = Semantic
WORDS se re
= ware ariguages
co Pepresent analysis qq}
“zrlaniglage sen”
om amiss
np lexical = op te
“tll nessa
observa vein CT
toms!
Vector Semantics
Positive Pointwise Mutual
Information (PPMI)
Dan Jurafsky Problem with raw counts • Raw word frequency is not a great measure of association between words • It’s very skewed • “the” and “of” are very frequent, but maybe not the most discriminative • We’d rather have a measure that asks whether a context word is particularly informative about the target word. • Positive Pointwise Mutual Information (PPMI) 21 Dan Jurafsky Pointwise Mutual Information Pointwise mutual information: Do events x and y co-‐occur more than if they were independent? PMI between two words: (Church & Hanks 1989) Do words x and y co-‐occur more than if they were independent? PMI 𝑤𝑜𝑟𝑑), 𝑤𝑜𝑟𝑑+ = log+ 𝑃(𝑤𝑜𝑟𝑑), 𝑤𝑜𝑟𝑑+) 𝑃 𝑤𝑜𝑟𝑑) 𝑃(𝑤𝑜𝑟𝑑+) PMI(X,Y ) = log2 P(x,y) P(x)P(y) Dan Jurafsky p(w=information,c=data) = p(w=information) = p(c=data) = 25 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 = .326/19 11/19 = .58 7/19 = .37 pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N Dan Jurafsky 26 pmiij = log2 pij pi*p* j • pmi(information,data) = log2 ( p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1 .32 / (.37*.58) ) = .58 (.57 using full precision) Dan Jurafsky Weighting PMI • PMI is biased toward infrequent events • Very rare words have very high PMI values • Two solutions: • Give rare words slightly higher probabilities • Use add-‐one smoothing (which has a similar effect) 27 Dan Jurafsky 30 Add#2%Smoothed%Count(w,context) computer data pinch result sugar apricot 2 2 3 2 3 pineapple 2 2 3 2 3 digital 4 3 2 3 2 information 3 8 2 6 2 p(w,context),[add02] p(w) computer data pinch result sugar apricot 0.03 0.03 0.05 0.03 0.05 0.20 pineapple 0.03 0.03 0.05 0.03 0.05 0.20 digital 0.07 0.05 0.03 0.05 0.03 0.24 information 0.05 0.14 0.03 0.10 0.03 0.36 p(context) 0.19 0.25 0.17 0.22 0.17 Dan Jurafsky PPMI versus add-‐2 smoothed PPMI 31 PPMI(w,context).[add22] computer data pinch result sugar apricot 0.00 0.00 0.56 0.00 0.56 pineapple 0.00 0.00 0.56 0.00 0.56 digital 0.62 0.00 0.00 0.00 0.00 information 0.00 0.58 0.00 0.37 0.00 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1 zB Lal
sentences sian crt SS
probability
at OF AMMAl SS
word
ooMOdel = = —
discourse = =
oud Bog. based ust oni
ledge & =
<0 5 > inforniation™
F nstead cDiz
{sree EB
mat go Pee
we algorithm.
human Pepresentation
pS semantio.
WOPdSE se
os represent aasis @E
"Se lariglagessan
om amiss
np lexical = op te
“tll nessa
observa vein CT
toms!
Vector Semantics
Measuring similarity: the
cosine
Dan Jurafsky Solution: cosine • Just divide the dot product by the length of the two vectors! • This turns out to be the cosine of the angle between them! 35 19.2 • SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 computer data pinch result sugar apricot 0 0 0.56 0 0.56 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:dot product inner product dot-product(~v,~w) =~v ·~w = NX i=1 viwi = v1w1 + v2w2 + ...+ vNwN (19.10) Intuitively, the dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alterna- tively, vectors that have zeros in different dimensions—orthogonal vectors— will be very dissimilar, with a dot product of 0. This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined asvector length |~v| = vuut NX i=1 v2 i (19.11) The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. Raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are irregardless of their frequency. The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors ~a and ~b: ~a ·~b = |~a||~b|cosq ~a ·~b |~a||~b| = cosq (19.12) The cosine similarity metric between two vectors~v and ~w thus can be computedcosine as: cosine(~v,~w) = ~v ·~w |~v||~w| = NX i=1 viwi vuut NX i=1 v2 i vuut NX i=1 w2 i (19.13) For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector 19.2 • SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 computer data pinch result sugar apricot 0 0 0.56 0 0.56 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:dot product inner product dot-product(~v,~w) =~v ·~w = NX i=1 viwi = v1w1 + v2w2 + ...+ vNwN (19.10) Intuitively, the dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alterna- tively, vectors that have zeros in different dimensions—orthogonal vectors— will be very dissimilar, with a dot product of 0. This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined asvector length |~v| = vuut NX i=1 v2 i (19.11) The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since ey tend to co-occur with more words and have higher co-occurrence values with each of them. Raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are irregardless of their frequency. The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors ~a and ~b: ~a ·~b = |~a||~b|cosq ~a ·~b |~a||~b| = cosq (19.12) The cosine similarity metric between two vectors~v and ~w thus can be computedcosine as: cosine(~v,~w) = ~v ·~w |~v||~w| = NX i=1 viwi vuut NX i=1 v2 i vuut NX i=1 w2 i (19.13) For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector Dan Jurafsky Cosine for computing similarity cos(v, w) = v • w v w = v v • w w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ Dot product Unit vectors vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i. Cos(v,w) is the cosine similarity of v and w Sec. 6.3 Dan Jurafsky Cosine as a similarity metric • -‐1: vectors point in opposite directions • +1: vectors point in same directions • 0: vectors are orthogonal • Raw frequency or PPMI are non-‐ negative, so cosine range 0-‐1 37 Dan Jurafsky Clustering vectors to visualize similarity in co-‐occurrence matrices Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence HEAD HANDFACE DOG AMERICA CAT EYE EUROPE FOOT CHINA FRANCE CHICAGO ARM FINGER NOS LEG RUSSIA MOUSE AFRICA ATLANTA EAR SHOULDER ASIA COW BULL PUPPY LION HAWAII MONTREAL TOKYO TOE MOSCOW TOOTH NASHVILLE BRAZIL WRIST KITTEN ANKLE TURTLE OYSTER Figure 8: Multidimensional scaling for three noun classes. WRIST ANKLE SHOULDER ARM LEG HAND FOOT HEAD NOSE FINGER TOE FACE EAR EYE TOOTH DOG CAT PUPPY KITTEN COW MOUSE TURTLE OYSTER LION BULL CHICAGO ATLANTA MONTREAL NASHVILLE TOKYO CHINA RUSSIA AFRICA ASIA EUROPE AMERICA BRAZIL MOSCOW FRANCE HAWAII Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations. 20 40 Rohde et al. (2006) Dan Jurafsky
Other possible similarity measures
| 8 - Vi XW;
SiMeosine(V") = Ty =
yer l a ii
| oy ar min(v;,W;)
SIM Jaccard(Y;? Ww) = So max (v3 Wi)
—N
| oe 2 2x 5%, min(v; W;)
SIMPjce(¥,W) = So (iw)
simyg (VI) = DOSE) + DOWER)
zB Lal
sentences sian crt SS
probability
at OF AMMAl SS
word
ooMOdel = = —
discourse = =
oud Bog. based ust oni
ledge & =
<0 5 > inforniation™
F nstead cDiz
{sree EB
mat go Pee
we algorithm.
human Pepresentation
pS semantio.
WOPdSE se
os represent aasis @E
"Se lariglagessan
om amiss
np lexical = op te
“tll nessa
observa vein CT
toms!
Vector Semantics
Measuring similarity: the
cosine
Dan Jurafsky Co-‐occurrence vectors based on syntactic dependencies • Each dimension: a context word in one of R grammatical relations • Subject-‐of-‐ “absorb” • Instead of a vector of |V| features, a vector of R|V| • Example: counts for the word cell : Dekang Lin, 1998 “Automatic Retrieval and Clustering of Similar Words” Dan Jurafsky Syntactic dependencies for dimensions • Alternative (Padó and Lapata 2007): • Instead of having a |V| x R|V| matrix • Have a |V| x |V| matrix • But the co-‐occurrence counts aren’t just counts of words in a window • But counts of words that occur in one of R dependencies (subject, object, etc). • So M(“cell”,”absorb”) = count(subj(cell,absorb)) + count(obj(cell,absorb)) + count(pobj(cell,absorb)), etc. 46 Dan Jurafsky PMI applied to dependency relations • “Drink it” more common than “drink wine” • But “wine” is a better “drinkable” thing than “it” Object of “drink” Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5 Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL tea 2 1.8 liquid 2 10.5 anything 3 5.2 it 3 .3 Z twin
sentences sian caret
probability
wat OT OMT Male es es
ooMOdel SBS ari
= (Pot a tS oe
“aie B a
<n Sinton
wn LE een 2 Pe
“na eae i
~PSitesg | Esai? ian
input i. a
urealigorithm-
man Pepresentation
ge Sp semantio
W 0 pds Gs s=—) = Sse
ce rarest is CT
omar qua 0: et stem
- =nuinher=s<3a
Bf owen dialogue = _ Using &
np ical _
* Hi models = rom
a)
Vector Semantics
Dense Vectors
Dan Jurafsky Sparse versus dense vectors • PPMI vectors are • long (length |V|= 20,000 to 50,000) • sparse (most elements are zero) • Alternative: learn vectors which are • short (length 200-‐1000) • dense (most elements are non-‐zero) 51 Dan Jurafsky Sparse versus dense vectors • Why dense vectors? • Short vectors may be easier to use as features in machine learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy: • car and automobile are synonyms; but are represented as distinct dimensions; this fails to capture similarity between a word with car as a neighbor and a word with automobile as a neighbor 52 Dan Jurafsky Intuition • Approximate an N-‐dimensional dataset using fewer dimensions • By first rotating the axes into a new space • In which the highest order dimension captures the most variance in the original dataset • And the next dimension captures the next most variance, etc. • Many such (related) methods: • PCA – principle components analysis • Factor Analysis • SVD 55 Dan Jurafsky 1 2 3 4 5 6 1 2 3 4 5 6 56 PCA dimension 1 PCA dimension 2 Dimensionality reduction Dan Jurafsky Singular Value Decomposition 57 Any rectangular w x c matrix X equals the product of 3 matrices: W: rows corresponding to original but m columns represents a dimension in a new latent space, such that • M column vectors are orthogonal to each other • Columns are ordered by the amount of variance in the dataset each new dimension accounts for S: diagonal m x mmatrix of singular values expressing the importance of each dimension. C: columns corresponding to original but m rows corresponding to singular values Dan Jurafsky LSA more details • 300 dimensions are commonly used • The cells are commonly weighted by a product of two weights • Local weight: Log term frequency • Global weight: either idf or an entropy measure 60 Dan Jurafsky Let’s return to PPMI word-‐word matrices • Can we apply to SVD to them? 61 Dan Jurafsky SVD applied to term-‐term matrix 19.3 • DENSE VECTORS AND SVD 13 Singular Value Decomposition (SVD) is a method for finding the most impor- tant dimensions of a data set, those dimensions along which the data varies the most. It can be applied to any rectangular matrix and in language processing it was first applied to the task of generating embeddings from term-document matrices by Deer- wester et al. (1988) in a model called Latent Semantic Indexing. In this section let’s look just at its application to a square term-context matrix M with |V | rows (one for each word) and columns (one for each context word) SVD factorizes M into the product of three square |V |⇥ |V | matrices W , S, and CT . In W each row still represents a word, but the columns do not; each column now represents a dimension in a latent space, such that the |V | column vectors are orthogonal to each other and the columns are ordered by the amount of variance in the original dataset each accounts for. S is a diagonal |V |⇥ |V | matrix, with singular values along the diagonal, expressing the importance of each dimension. The |V |⇥ |V | matrix CT still represents contexts, but the rows now represent the new latent dimensions and the |V | row vectors are orthogonal to each other. By using only the first k dimensions, of W, S, and C instead of all |V | dimen- sions, the product of these 3 matrices becomes a least-squares approximation to the original M. Since the first dimensions encode the most variance, one way to view the reconstruction is thus as modeling the most important information in the original dataset. SVD applied to co-occurrence matrix X: 2 666664 X 3 777775 |V |⇥ |V | = 2 666664 W 3 777775 |V |⇥ |V | 2 666664 s1 0 0 . . . 0 0 s2 0 . . . 0 0 0 s3 . . . 0 ... ... ... . . . ... 0 0 0 . . . sV 3 777775 |V |⇥ |V | 2 666664 C 3 777775 |V |⇥ |V | Taking only the top k dimensions after SVD applied to co-occurrence matrix X: 2 666664 X 3 777775 |V |⇥ |V | = 2 666664 W 3 777775 |V |⇥ k 2 666664 s1 0 0 . . . 0 0 s2 0 . . . 0 0 0 s3 . . . 0 ... ... ... . . . ... 0 0 0 . . . sk 3 777775 k⇥ k h C i k⇥ |V | Figure 19.11 SVD factors a matrix X into a product of three matrices, W, S, and C. Taking the first k dimensions gives a |V |⇥k matrix Wk that has one k-dimensioned row per word that can be used as an embedding. Using only the top k dimensions (corresponding to the k most important singular values), leads to a reduced |V |⇥k matrix Wk, with one k-dimensioned row per word. This row now acts as a dense k-dimensional vector (embedding) representing that word, substituting for the very high-dimensional rows of the original M.3 3 Note that early systems often instead weighted Wk by the singular values, using the product Wk ·Sk as an embedding instead of just the matrix Wk , but this weighting leads to significantly worse embeddings (Levy et al., 2015). 62 (I’m simplifying here by assuming the matrix has rank |V|) Dan Jurafsky Embeddings versus sparse vectors • Dense SVD embeddings sometimes work better than sparse PPMI matrices at tasks like word similarity • Denoising: low-‐order dimensions may represent unimportant information • Truncation may help the models generalize better to unseen data. • Having a smaller number of dimensions may make it easier for classifiers to properly weight the dimensions for the task. • Dense models may do better at capturing higher order co-‐ occurrence. 65 Vector Semantics Embeddings inspired by neural language models: skip-‐grams and CBOW Dan Jurafsky Prediction-‐based models: An alternative way to get dense vectors • Skip-‐gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b) • Learn embeddings as part of the process of word prediction. • Train a neural network to predict neighboring words • Inspired by neural net language models. • In so doing, learn dense embeddings for the words in the training corpus. • Advantages: • Fast, easy to train (much faster than SVD) • Available online in the word2vec package • Including sets of pretrained embeddings!67 Dan Jurafsky Setup • Walking through corpus pointing at word w(t), whose index in the vocabulary is j, so we’ll call it wj (1 < j < |V |). • Let’s predict w(t+1) , whose index in the vocabulary is k (1 < k < |V |). Hence our task is to compute P(wk|wj). 70 Dan Jurafsky One-‐hot vectors • A vector of length |V| • 1 for the target word and 0 for other words • So if “popsicle” is vocabulary word 5 • The one-‐hot vector is • [0,0,0,0,1,0,0,0,0…….0] 71 Dan Jurafsky Input layer Projection layer Output layer W |V|⨉d wt wt-1 wt+1 1-hot input vector 1⨉d1⨉|V| embedding for wt probabilities of context words W’ d ⨉ |V| W’ d ⨉ |V| x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V| 72 Skip-‐gram Dan Jurafsky Turning outputs into probabilities • ok = v’k·∙vj • We use softmax to turn into probabilities 75 19.4 • EMBEDDINGS FROM PREDICTION: SKIP-GRAM AND CBOW 15 Input layer Projection layer Output layer W |V|⨉d wt wt-1 wt+1 1-hot input vector 1⨉d1⨉|V| embedding for wt probabilities of context words W’ d ⨉ |V| W’ d ⨉ |V| x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V| Figure 19.12 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a). We begin with an input vector x, which is a one-hot vector for the current word w j (hence x j = 1, and xi = 0 8i 6= j). We then predict the probability of each of the 2C output words—in Fig. 17.12 that means the two output words wt 1 and wt+1— in 3 steps: 1. x is multiplied by W , the input matrix, to give the hidden or projection layer.projection layer Since each column of the input matrix W is just an embedding for word wt , and the input is a one-hot vector for w j, the projection layer for input x will be h = v j, the input embedding for w j. 2. For each of the 2C context words we now multiply the projection vector h by the output matrix W 0. The result for each context word, o = W 0h, is a 1⇥ |V | dimensional output vector giving a score for each of the |V | vocabulary words. In doing so, the element ok was computed by multiplying h by the output embedding for word wk: ok = v0kh. 3. Finally, for each context word we normalize this score vector, turning the score for each element ok into a probability by using the soft-max function: p(wk|w j) = exp(v0k · v j)P w02|V | exp(v0w · v j) (19.24) The next section explores how the embeddings, the matrices W and W 0, are learned. Once they are learned, we’ll have two embeddings for each word wi: vi and v0i. We can just choose to use the input embedding vi from W , or we can add the two and use the embedding vi + v0i as the new d-dimensional embedding, or we can concatenate them into an embedding of dimensionality 2d. As with the simple count-based methods like PPMI, the context window size C effects the performance of skip-gram embeddings, and experiments often tune the parameter C on a dev set. As as with PPMI, window sizing leads to qualitative differences: smaller windows capture more syntactic information, larger ones more semantic and relational information. One difference from the count-based methods Dan Jurafsky Embeddings from W and W’ • Since we have two embeddings, vj and v’j for each word wj • We can either: • Just use vj • Sum them • Concatenate them to make a double-‐length embedding 76 Dan Jurafsky But wait; how do we learn the embeddings? 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 Learning the input and output embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, Text. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(t C), ...,w(t 1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the vocabulary. The actual training objective for skip-gram, the negative sampling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov et al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) shows that skip-gram’s optimal value occurs 77 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neig boring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 Learning the input and outpu embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. Th goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, Text. argmax q og p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(t C), ...,w(t 1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the vocabulary. The actual training objective for skip-gram, the negative sampling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov t al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldb rg (2014b) shows that skip-gr m’s optimal value occurs 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 Learning the input and output embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the paramet rs q ) hat do well at predicting the context wo ds, m ximizing the log likelihood of the corpus, Text. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q l g TY t=1 p(w(t C), ...,w(t 1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume th t the pr babilities of each context (output) ord is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context ords w(t+ j), while minimizing the si ilarity betwe n word w(t) and all the wor s in the vocabulary. The actual training objective for skip-gram, the negative sa pling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov et al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) shows that skip-gram’s optimal value occurs 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the ore co putation the algorithm requires for training (more neighboring words must be predicted). ee the end of the chapter for a pointer to surveys which have explored para t i ti like indow- size for different ta ks. 19.4.1 Le r i g he input and output e e i s There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of t e model is to le r representations (the em edding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, Text. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(t C), ...,w(t 1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the oth r outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the vocabulary. The actual training objective for skip-gram, the negative sampling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov et al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multipl the two con ext matrices W ·W 0T , we produce |V |⇥ |V | ma rix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) sho s that skip-gram’s optimal value occurs 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighb ring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 L arni g the i put and output embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, T xt. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(t C), ...,w(t 1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are lookin to set th parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the v cabulary. The actual training objective for skip-gram, the negative sampling app oach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described i Chapter 5 (Mikolov et l., 2013a). There is an i teresting relationship b tween skip-g ams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) shows that skip-gram’s optimal value occurs Dan Jurafsky Properties of embeddings 80 • Nearest words to some embeddings (Mikolov et al. 20131) 18 CHAPTER 19 • VECTOR SEMANTICS matrix is repeated between each one-hot input and the projection layer h. For the case of C = 1, these two embeddings must be combined into the projection layer, which is done by multiplying each one-hot context vector x by W to give us two input vectors (let’s say vi and v j). We then average these vectors h = W · 1 2C X c jc, j 6=0 v( j) (19.31) As with skip-grams, the the projection vector h is multiplied by the output matrix W 0. The result o = W 0h is a 1⇥ |V | dimensional output vector giving a score for each of the |V | words. In doing so, the element ok was computed by multiplying h by the output embedding for word wk: ok = v0kh. Finally we normalize this score vector, turning the score for each element ok into a probability by using the soft-max function. 19.5 Properties of embeddings We’ll discuss in Section 17.8 how to evaluate the quality of different embeddings. But it is also sometimes helpful to visualize them. Fig. 17.14 shows the words/phrases that are most similar to some sample words using the phrase-based version of the skip-gram algorithm (Mikolov et al., 2013a). target: Redmond Havel ninjutsu graffiti capitulate Redmond Wash. Vaclav Havel ninja spray paint capitulation Redmond Washington president Vaclav Havel martial arts grafitti capitulated Microsoft Velvet Revolution swordsmanship taggers capitulating Figure 19.14 Examples of the closest tokens to some target words using a phrase-based extension of the skip-gram algorithm (Mikolov et al., 2013a). One semantic property of various kinds of embeddings that may play in their usefulness is their ability to capture relational meanings Mikolov et al. (2013b) demonstrates that the offsets between vector embeddings can capture some relations between words, for example that the result of the ex- pression vector(‘king’) - vector(‘man’) + vector(‘woman’) is a vector close to vec- tor(‘queen’); the left panel in Fig. 17.15 visualizes this by projecting a representation down into 2 dimensions. Similarly, they found that the expression vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vec- tor(‘Rome’). Levy and Goldberg (2014a) shows that various other kinds of em- beddings also seem to have this property. We return in the next section to these relational properties of embeddings and how they relate to meaning compositional- ity: the way the meaning of a phrase is built up out of the meaning of the individual vectors. 19.6 Compositionality in Vector Models of Meaning To be written. Dan Jurafsky Embeddings capture relational meaning! vector(‘king’) -‐ vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) -‐ vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’) 81 Z twin
sentences sian crt SS
probability
wat OT OMT Male es es
soolMOCE] £3 =
ai 2 en es A Ate
“oe & a
art Sinton
ate ee —-on. az rule
“Le fis | eee
~PSitesg | Esai? ant
mat go Pee
we algorithm.
human Pepresentation
pS semantio.
WORUSE = ie
os represent aasis @E
"Se lariglagessan
ri =nuinher=s<3a
ee egwven dialogue = _ using i
np lexical => oo
we Hi models => rom
a)
Vector Semantics
Brown clustering
Dan Jurafsky Brown Clusters as vectors • By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top. • Each word represented by binary string = path from root to leaf • Each intermediate node is a cluster • Chairman is 0010, “months” = 01, and verbs = 1 85 Brown Algorithm • Words merged according to contextual similarity • Clusters are equivalent to bit-string prefixes • Prefix length determines the granularity of the clustering 011 president walk run sprint chairman CEO November October 0 1 00 01 00110010 001 10 11 000 100 101010 Dan Jurafsky Brown cluster examples 20 CHAPTER 19 • VECTOR SEMANTICSBrown Algorithm • Words merged according to contextual similarity • Clusters are equivalent to bit-string prefixes • Prefix length determines the granularity of the clustering 011 president walk run sprint chairman CEO November October 0 1 00 01 00110010 001 10 11 000 100 101010 Figure 19.16 Brown clustering as a binary tree. A full binary string represents a word; each binary prefix represents a larger class to which the word belongs and can be used as an vector representation for the word. After Koo et al. (2008). After clustering, a word can be represented by the binary string that corresponds to its path from the root node; 0 for left, 1 for right, at each choice point in the binary tree. For example in Fig. 19.16, the word chairman is the vector 0010 and October is 011. Since Brown clustering is a hard clustering algorithm (each word has onlyhard clustering cluster), there is just one string per word. Now we can extract useful features by taking the binary prefixes of this bit string; each prefix represents a cluster to which the word belongs. For example the string 01 in the figure represents the cluster of month names {November, October}, the string 0001 the names of common nouns for corporate executives {chairman, president}, 1 is verbs {run, sprint, walk}, and 0 is nouns. These prefixes can then be used as a vector representation for the word; the shorter the prefix, the more abstract the cluster. The length of the vector representation can thus be adjusted to fit the needs of the particular task. Koo et al. (2008) improving parsing by using multiple features: a 4-6 bit prefix to capture part of speech information and a full bit string to represent words. Spitkovsky et al. (2011) shows that vectors made of the first 8 or 9-bits of a Brown clustering perform well at grammar induction. Because they are based on immediately neighboring words, Brown clusters are most commonly used for representing the syntactic properties of words, and hence are commonly used as a feature in parsers. Nonetheless, the clusters do represent some semantic properties as well. Fig. 19.17 shows some examples from a large clustering from Brown et al. (1992). Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August pressure temperature permeability density porosity stress velocity viscosity gravity tension anyone someone anybody somebody had hadn’t hath would’ve could’ve should’ve must’ve might’ve asking telling wondering instructing informing kidding reminding bothering thanking deposing mother wife father son husband brother daughter sister boss uncle great big vast sudden mere sheer gigantic lifelong scant colossal down backwards ashore sideways southward northward overboard aloft downwards adrift Figure 19.17 Some sample Brown clusters from a 260,741-word vocabulary trained on 366 million words of running text (Brown et al., 1992). Note the mixed syntactic-semantic nature of the clusters. Note that the naive version of the Brown clustering algorithm described above is extremely inefficient — O(n5): at each of n iterations, the algorithm considers each of n2 merges, and for each merge, compute the value of the clustering by summing over n2 terms. because it has to consider every possible pair of merges. In practice we use more efficient O(n3) algorithms that use tables to pre-compute the values for each merge (Brown et al. 1992, Liang 2005). 86 Dan Jurafsky Class-‐based language model • Suppose each word was in some class ci: 87 19.7 • BROWN CLUSTERING 19 Figure 19.15 Vector offsets showing relational properties of the vector space, shown by projecting vectors onto two dimensions using PCA. In the left panel, ’king’ - ’man’ + ’woman’ is close to ’queen’. In the right, we see the way offsets seem to capture grammatical number. (Mikolov et al., 2013b). 19.7 Brown Clustering Brown clustering (Brown et al., 1992) is an agglomerative clustering algorithm forBrown clustering deriving vector representations of words by clustering words based on their associa- tions with the preceding or following words. The algorithm makes use of the class-based language model (Brown et al.,class-based language model 1992), a model in which each word w 2V belongs to a class c 2C with a probability P(w|c). Class based LMs assigns a probability to a pair of words wi 1 and wi by modeling the transition between classes rather than between words: P(wi|wi 1) = P(ci|ci 1)P(wi|ci) (19.32) The class-based LM can be used to assign a probability to an entire corpus given a particularly clustering C as follows: P(corpus|C) = nY i 1 P(ci|ci 1)P(wi|ci) (19.33) Class-based language models are generally not used as a language model for ap- plications like machine translation or speech recognition because they don’t work as well as standard n-grams or neural language models. But they are an important component in Brown clustering. Brown clustering is a hierarchical clustering algorithm. Let’s consider a naive (albeit inefficient) version of the algorithm: 1. Each word is initially assigned to its own cluster. 2. We now consider consider merging each pair of clusters. The pair whose merger results in the smallest decrease in the likelihood of the corpus (accord- ing to the class-based language model) is merged. 3. Clustering proceeds until all words are in one big cluster. Two words are thus most likely to be clustered if they have similar probabilities for preceding and following words, leading to more coherent clusters. The result is that words will be merged if they are contextually similar. By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top, in which the leaves are the words in the vocabulary, and each intermediate node in the tree represents the cluster that is formed by merging its children. Fig. 19.16 shows a schematic view of a part of a tree. 19.7 • BROWN CLUSTERING 19 Figure 19.15 Vector offsets showing relational properties of the vector space, shown by projecting vectors onto two dimensions using PCA. In the left panel, ’king’ - ’man’ + ’woman’ is close to ’queen’. In the right, we see the way offsets seem to capture grammatical number. (Mikolov et al., 2013b). 19.7 Brown Clustering Brown clustering (Brown et al., 1992) is an agglomerative clustering algorithm forBrown clustering deriving vector representations of words by clustering words based on their associa- tions with the preceding or following words. The algorithm makes use of the class-based language model (Brown et al.,class-based language model 1992), a model in which each word w 2V belongs to a class c 2C with a probability P(w|c). Class based LMs assigns a probability to a pair of words wi 1 and i by modeling the trans tion between classes rather than between words: P(wi|wi 1) = P(ci|ci 1)P(wi|ci) (19.32) The class-based LM can be used to assign a probability to an entire corpus given a particularly clustering C as follows: P(corpus|C) = nY i 1 P(ci|ci 1)P(wi|ci) (19.33) Class-based language models are generally not used as a language model for ap- plications like machine translation or speech recognition because they don’t work as well as standard n-grams or neural language models. But they are an important component in Brown clustering. Brown clustering is a hierarchical clustering algorithm. Let’s consider a naive (albeit inefficient) version of the algorithm: 1. Each word is initially assigned to its own cluster. 2. We now consider consider merging each pair of clusters. The pair whose merger results in the smallest decrease in the likelihood of the corpus (accord- ing to the class-based language model) is merged. 3. Clustering proceeds unt l all words are i one big cluster. Two ords are thus mo t likely to be clustered if they have similar probabilities for preceding and following words, leading to more coherent clusters. The re ult is that words will be merged if they are contextually similar. By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top, in which the leaves are the words in the vocabulary, and each intermediate node in the tree represents the cluster that is formed by merging its childre . Fig. 19.16 shows a schematic view f a part of a t e.