Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Vector Semantics, Lecture notes of Vector Analysis

The use of vector models of meaning in computing the similarity between words. It highlights the problems with thesaurus-based meaning and introduces distributional models of meaning. the intuition of distributional word similarity and the use of co-occurrence matrices. It also discusses the term-document matrix and the word-word or word-context matrix.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

anoushka
anoushka 🇺🇸

4.1

(14)

241 documents

Partial preview of the text

Download Vector Semantics and more Lecture notes Vector Analysis in PDF only on Docsity! zB fun sentenceSsione probability, at OF AMMAl SS sore MOCE] £5 ae a et mere Ate ledge 2 = ote Sinton “aco | ae iaesinceer FEBS aie ‘put evegi es te =, algorithm funn representation tc Taig ay = Semantic WORUS? 2 nie os . ware ariguages 5 om OS Sales form COrpuS conser | a | quagissens “re qumberes ETD processing ee egwven dialogue = _ using i np iaical SS a, ve Hi models == ran ~ a) Vector Semantics Introduction Dan  Jurafsky Why  vector  models  of  meaning? computing  the  similarity  between  words “fast”  is  similar  to  “rapid” “tall”  is  similar  to  “height” Question  answering: Q:  “How  tall is  Mt.  Everest?” Candidate  A:  “The  official  height of  Mount  Everest  is  29029  feet” 2 Dan  Jurafsky Problems  with  thesaurus-­‐based  meaning • We  don’t  have  a  thesaurus  for  every  language • We  can’t  have  a  thesaurus  for  every  year • For  historical  linguistics,  we  need  to  compare  word  meanings   in  year  t  to  year  t+1 • Thesauruses  have  problems  with  recall • Many  words  and  phrases  are  missing • Thesauri  work  less  well  for  verbs,  adjectives Dan  Jurafsky Distributional  models  of  meaning =  vector-­‐space  models  of  meaning   =  vector  semantics Intuitions:    Zellig Harris  (1954): • “oculist  and  eye-­‐doctor  …  occur  in  almost  the  same   environments” • “If  A  and  B  have  almost  identical  environments  we  say  that   they  are  synonyms.” Firth  (1957):   • “You  shall  know  a  word  by  the  company  it  keeps!” 6 Dan  Jurafsky Intuition  of  distributional  word  similarity • Nida example:  Suppose  I  asked  you  what  is  tesgüino? A bottle of tesgüino is on the table Everybody likes tesgüino Tesgüino makes you drunk We make tesgüino out of corn. • From  context  words  humans  can  guess  tesgüinomeans • an  alcoholic  beverage  like  beer • Intuition  for  algorithm:   • Two  words  are  similar  if  they  have  similar  word  contexts. zB Lal sentences sian crt SS probability = orammang rd aii =. ooMOdel aon important ui @ ‘sa ea oA oni ledge & = <0 5 > inforniation™ F nstead cDiz {sree EB mat go Pee we algorithm. human Pepresentation pS semantio. WOPdSE se os represent aasis @E "Se lariglagessan ri =nuinher=s<3a = Bowen dialogue 5 = _USINg § np lexical => oo we Hin models => rom ser ETS toms! tai Vector Semantics Words and co-occurrence vectors Dan  Jurafsky Co-­‐occurrence  Matrices • We  represent  how  often  a  word  occurs  in  a   document • Term-­‐document  matrix • Or  how  often  a  word  occurs  with  another • Term-­‐term  matrix   (or  word-­‐word  co-­‐occurrence  matrix or  word-­‐context  matrix)11 Dan  Jurafsky As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 Term-­‐document  matrix • Each  cell:  count  of  word  w in  a  document  d: • Each  document  is  a  count  vector  in  ℕv:  a  column  below   12 Dan  Jurafsky The  words  in  a  term-­‐document  matrix • Two  words are  similar  if  their  vectors  are  similar 15 As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 Dan  Jurafsky The  word-­‐word  or  word-­‐context  matrix • Instead  of  entire  documents,  use  smaller  contexts • Paragraph • Window  of  ± 4  words • A  word  is  now  defined  by  a  vector  over  counts  of   context  words • Instead  of  each  vector  being  of  length  D • Each  vector  is  now  of  length  |V| • The  word-­‐word  matrix  is  |V|x|V|16 Dan  Jurafsky Word-­‐Word  matrix Sample  contexts  ± 7  words 17 aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 19.1 • WORDS AND VECTORS 3 tors of numbers representing the terms (words) that occur within the collection (Salton, 1971). In information retrieval these numbers are called the term weight, aterm weight function of the term’s frequency in the document. More generally, the term-document matrix X has V rows (one for each word type in the vocabulary) and D columns (one for each document in the collection). Each column represents a document. A query is also represented by a vector q of length |V |. We go about finding the most relevant document to query by finding the document whose vector is most similar to the query; later in the chapter we’ll introduce some of the components of this process: the tf-idf term weighting, and the cosine similarity metric. But now let’s turn to the insight of vector semantics for representing the meaning of words. The idea is that we can also represent each word by a vector, now a row vector representing the counts of the word’s occurrence in each document. Thus the vectors for fool [37,58,1,5] and clown [5,117,0,0] are more similar to each other (occurring more in the comedies) while battle [1,1,8,15] and soldier [2,2,12,36] are more similar to each other (occurring less in the comedies). More commonly used for vector semantics than this term-document matrix is an alternative formulation, the term-term matrix, more commonly called the word-term-term matrix word matrix oro the term-context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality |V |⇥ |V | and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. The context could be the document, in which case the cell represents the number of times the two words appear in the same document. It is most common, however, to use smaller contexts, such as a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here are 7-word windows surrounding four sample words from the Brown corpus (just one example of each word): sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the For each word we collect the counts (from the windows around each occurrence) of the occurrences of context words. Fig. 17.2 shows a selection from the word-word co-occurrence matrix computed from the Brown corpus for these four words. aardvark ... computer data pinch result sugar ... apricot 0 ... 0 0 1 0 1 pineapple 0 ... 0 0 1 0 1 digital 0 ... 2 1 0 1 0 information 0 ... 1 6 0 4 0 Figure 19.2 Co-occurrence vectors for four words, computed from the Brown corpus, showing only six of the dimensions (hand-picked for pedagogical purposes). Note that a real vector would be vastly more sparse. The shading in Fig. 17.2 makes clear the intuition that the two words apricot and pineapple are more similar (both pinch and sugar tend to occur in their window) while digital and information are more similar. Note that |V |, the length of the vector, is generally the size of the vocabulary, usually between 10,000 and 50,000 words (using the most frequent words in the … … r Z twin senteftes sar crt SS probability at OF AMMAl SS cal QUE] $3 = = Sy oiaine ate ledge & = 27 <0 5 aso mb een. a = [ule nse te ieoia |; machine "ME, ~PSitesg | Esai? ean oi 0 GSO Seo ei algorithm. funn representation te Taig ay = Semantic WORDS se re = ware ariguages co Pepresent analysis qq} “zrlaniglage sen” om amiss np lexical = op te “tll nessa observa vein CT toms! Vector Semantics Positive Pointwise Mutual Information (PPMI) Dan  Jurafsky Problem  with  raw  counts • Raw  word  frequency  is  not  a  great  measure  of   association  between  words • It’s  very  skewed • “the”  and  “of”  are  very  frequent,  but  maybe  not  the  most   discriminative • We’d  rather  have  a  measure  that  asks  whether  a  context  word  is   particularly  informative  about  the  target  word. • Positive  Pointwise Mutual  Information  (PPMI) 21 Dan  Jurafsky Pointwise Mutual  Information Pointwise  mutual  information:   Do  events  x  and  y  co-­‐occur  more  than  if  they  were  independent? PMI  between  two  words:    (Church  &  Hanks  1989) Do  words  x  and  y  co-­‐occur  more  than  if  they  were  independent?   PMI 𝑤𝑜𝑟𝑑), 𝑤𝑜𝑟𝑑+ = log+ 𝑃(𝑤𝑜𝑟𝑑), 𝑤𝑜𝑟𝑑+) 𝑃 𝑤𝑜𝑟𝑑) 𝑃(𝑤𝑜𝑟𝑑+) PMI(X,Y ) = log2 P(x,y) P(x)P(y) Dan  Jurafsky p(w=information,c=data)  =   p(w=information)  = p(c=data)  = 25 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 =  .326/19 11/19 =  .58 7/19 =  .37 pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N Dan  Jurafsky 26 pmiij = log2 pij pi*p* j • pmi(information,data)  =  log2 ( p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1 .32  / (.37*.58)  ) =  .58 (.57  using  full  precision) Dan  Jurafsky Weighting  PMI • PMI  is  biased  toward  infrequent  events • Very  rare  words  have  very  high  PMI  values • Two  solutions: • Give  rare  words  slightly  higher  probabilities • Use  add-­‐one  smoothing  (which  has  a  similar  effect) 27 Dan  Jurafsky 30 Add#2%Smoothed%Count(w,context) computer data pinch result sugar apricot 2 2 3 2 3 pineapple 2 2 3 2 3 digital 4 3 2 3 2 information 3 8 2 6 2 p(w,context),[add02] p(w) computer data pinch result sugar apricot 0.03 0.03 0.05 0.03 0.05 0.20 pineapple 0.03 0.03 0.05 0.03 0.05 0.20 digital 0.07 0.05 0.03 0.05 0.03 0.24 information 0.05 0.14 0.03 0.10 0.03 0.36 p(context) 0.19 0.25 0.17 0.22 0.17 Dan  Jurafsky PPMI  versus  add-­‐2  smoothed  PPMI 31 PPMI(w,context).[add22] computer data pinch result sugar apricot 0.00 0.00 0.56 0.00 0.56 pineapple 0.00 0.00 0.56 0.00 0.56 digital 0.62 0.00 0.00 0.00 0.00 information 0.00 0.58 0.00 0.37 0.00 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1 zB Lal sentences sian crt SS probability at OF AMMAl SS word ooMOdel = = — discourse = = oud Bog. based ust oni ledge & = <0 5 > inforniation™ F nstead cDiz {sree EB mat go Pee we algorithm. human Pepresentation pS semantio. WOPdSE se os represent aasis @E "Se lariglagessan om amiss np lexical = op te “tll nessa observa vein CT toms! Vector Semantics Measuring similarity: the cosine Dan  Jurafsky Solution:  cosine • Just  divide  the  dot  product  by  the  length  of  the  two  vectors! • This  turns  out  to  be  the  cosine  of  the  angle  between  them! 35 19.2 • SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 computer data pinch result sugar apricot 0 0 0.56 0 0.56 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:dot product inner product dot-product(~v,~w) =~v ·~w = NX i=1 viwi = v1w1 + v2w2 + ...+ vNwN (19.10) Intuitively, the dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alterna- tively, vectors that have zeros in different dimensions—orthogonal vectors— will be very dissimilar, with a dot product of 0. This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined asvector length |~v| = vuut NX i=1 v2 i (19.11) The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. Raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are irregardless of their frequency. The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors ~a and ~b: ~a ·~b = |~a||~b|cosq ~a ·~b |~a||~b| = cosq (19.12) The cosine similarity metric between two vectors~v and ~w thus can be computedcosine as: cosine(~v,~w) = ~v ·~w |~v||~w| = NX i=1 viwi vuut NX i=1 v2 i vuut NX i=1 w2 i (19.13) For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector 19.2 • SPARSE VECTOR MODELS: POSITIVE POINTWISE MUTUAL INFORMATION 7 computer data pinch result sugar apricot 0 0 0.56 0 0.56 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing counts in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:dot product inner product dot-product(~v,~w) =~v ·~w = NX i=1 viwi = v1w1 + v2w2 + ...+ vNwN (19.10) Intuitively, the dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alterna- tively, vectors that have zeros in different dimensions—orthogonal vectors— will be very dissimilar, with a dot product of 0. This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined asvector length |~v| = vuut NX i=1 v2 i (19.11) The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since ey tend to co-occur with more words and have higher co-occurrence values with each of them. Raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are irregardless of their frequency. The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors ~a and ~b: ~a ·~b = |~a||~b|cosq ~a ·~b |~a||~b| = cosq (19.12) The cosine similarity metric between two vectors~v and ~w thus can be computedcosine as: cosine(~v,~w) = ~v ·~w |~v||~w| = NX i=1 viwi vuut NX i=1 v2 i vuut NX i=1 w2 i (19.13) For some applications we pre-normalize each vector, by dividing it by its length, creating a unit vector of length 1. Thus we could compute a unit vector from ~a byunit vector Dan  Jurafsky Cosine  for  computing  similarity cos(v, w) = v • w v w = v v • w w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ Dot product Unit vectors vi is  the  PPMI  value  for  word  v in  context  i wi is  the  PPMI  value  for  word  w in  context  i. Cos(v,w)  is  the  cosine  similarity  of  v and  w Sec. 6.3 Dan  Jurafsky Cosine  as  a  similarity  metric • -­‐1:  vectors  point  in  opposite  directions   • +1:    vectors  point  in  same  directions • 0:  vectors  are  orthogonal • Raw  frequency  or  PPMI  are  non-­‐ negative,  so    cosine  range  0-­‐1 37 Dan  Jurafsky Clustering  vectors  to   visualize  similarity  in   co-­‐occurrence   matrices Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence HEAD HANDFACE DOG AMERICA CAT EYE EUROPE FOOT CHINA FRANCE CHICAGO ARM FINGER NOS LEG RUSSIA MOUSE AFRICA ATLANTA EAR SHOULDER ASIA COW BULL PUPPY LION HAWAII MONTREAL TOKYO TOE MOSCOW TOOTH NASHVILLE BRAZIL WRIST KITTEN ANKLE TURTLE OYSTER Figure 8: Multidimensional scaling for three noun classes. WRIST ANKLE SHOULDER ARM LEG HAND FOOT HEAD NOSE FINGER TOE FACE EAR EYE TOOTH DOG CAT PUPPY KITTEN COW MOUSE TURTLE OYSTER LION BULL CHICAGO ATLANTA MONTREAL NASHVILLE TOKYO CHINA RUSSIA AFRICA ASIA EUROPE AMERICA BRAZIL MOSCOW FRANCE HAWAII Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations. 20 40 Rohde  et  al.  (2006) Dan Jurafsky Other possible similarity measures | 8 - Vi XW; SiMeosine(V") = Ty = yer l a ii | oy ar min(v;,W;) SIM Jaccard(Y;? Ww) = So max (v3 Wi) —N | oe 2 2x 5%, min(v; W;) SIMPjce(¥,W) = So (iw) simyg (VI) = DOSE) + DOWER) zB Lal sentences sian crt SS probability at OF AMMAl SS word ooMOdel = = — discourse = = oud Bog. based ust oni ledge & = <0 5 > inforniation™ F nstead cDiz {sree EB mat go Pee we algorithm. human Pepresentation pS semantio. WOPdSE se os represent aasis @E "Se lariglagessan om amiss np lexical = op te “tll nessa observa vein CT toms! Vector Semantics Measuring similarity: the cosine Dan  Jurafsky Co-­‐occurrence  vectors  based  on  syntactic  dependencies • Each  dimension:  a  context  word  in  one  of  R  grammatical  relations • Subject-­‐of-­‐ “absorb” • Instead  of  a  vector  of  |V|  features,  a  vector  of  R|V| • Example:  counts  for  the  word  cell : Dekang Lin,  1998  “Automatic  Retrieval  and  Clustering  of  Similar  Words” Dan  Jurafsky Syntactic  dependencies  for  dimensions • Alternative  (Padó and  Lapata 2007): • Instead  of  having  a  |V|  x  R|V|  matrix • Have  a  |V|  x  |V|  matrix • But  the  co-­‐occurrence  counts  aren’t  just  counts  of  words  in  a  window • But  counts  of  words  that  occur  in  one  of  R  dependencies  (subject,  object,   etc). • So  M(“cell”,”absorb”)  =  count(subj(cell,absorb))  +  count(obj(cell,absorb))   +  count(pobj(cell,absorb)),    etc. 46 Dan  Jurafsky PMI  applied  to  dependency  relations • “Drink it” more  common  than  “drink wine” • But  “wine”  is  a  better  “drinkable”  thing  than  “it” Object  of  “drink” Count PMI it 3 1.3 anything 3 5.2 wine 2 9.3 tea 2 11.8 liquid 2 10.5 Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL     tea 2 1.8 liquid 2 10.5 anything 3 5.2 it 3 .3 Z twin sentences sian caret probability wat OT OMT Male es es ooMOdel SBS ari = (Pot a tS oe “aie B a <n Sinton wn LE een 2 Pe “na eae i ~PSitesg | Esai? ian input i. a urealigorithm- man Pepresentation ge Sp semantio W 0 pds Gs s=—) = Sse ce rarest is CT omar qua 0: et stem - =nuinher=s<3a Bf owen dialogue = _ Using & np ical _ * Hi models = rom a) Vector Semantics Dense Vectors Dan  Jurafsky Sparse  versus  dense  vectors • PPMI  vectors  are • long (length  |V|=  20,000  to  50,000) • sparse  (most  elements  are  zero) • Alternative:  learn  vectors  which  are • short (length  200-­‐1000) • dense (most  elements  are  non-­‐zero) 51 Dan  Jurafsky Sparse  versus  dense  vectors • Why  dense  vectors? • Short  vectors  may  be  easier  to  use  as  features  in  machine   learning  (less  weights  to  tune) • Dense  vectors  may  generalize  better  than  storing  explicit  counts • They  may  do  better  at  capturing  synonymy: • car and  automobile are  synonyms;  but  are  represented  as   distinct  dimensions;  this  fails  to  capture  similarity  between  a   word  with  car as  a  neighbor  and  a  word  with  automobile as  a   neighbor 52 Dan  Jurafsky Intuition • Approximate  an  N-­‐dimensional  dataset  using  fewer  dimensions • By  first  rotating  the  axes  into  a  new  space • In  which  the  highest  order  dimension  captures  the  most   variance  in  the  original  dataset • And  the  next  dimension  captures  the  next  most  variance,  etc. • Many  such  (related)  methods: • PCA  – principle  components  analysis • Factor  Analysis • SVD 55 Dan  Jurafsky 1 2 3 4 5 6 1 2 3 4 5 6 56 PCA dimension 1 PCA dimension 2 Dimensionality  reduction Dan  Jurafsky Singular  Value  Decomposition 57 Any  rectangular  w  x  c  matrix  X  equals  the  product  of  3  matrices: W:  rows  corresponding  to  original  but  m  columns  represents  a   dimension  in  a  new  latent  space,  such  that   • M  column  vectors  are  orthogonal  to  each  other • Columns  are  ordered  by  the  amount  of  variance  in  the  dataset  each  new   dimension  accounts  for S:    diagonal  m x  mmatrix  of  singular  values  expressing  the   importance  of  each  dimension. C:  columns  corresponding  to  original  but  m  rows  corresponding  to   singular  values Dan  Jurafsky LSA  more  details • 300  dimensions  are  commonly  used • The  cells  are  commonly  weighted  by  a  product  of  two  weights • Local  weight:    Log  term  frequency • Global  weight:  either  idf or  an  entropy  measure 60 Dan  Jurafsky Let’s  return  to  PPMI  word-­‐word  matrices • Can  we  apply  to  SVD  to  them? 61 Dan  Jurafsky SVD  applied  to  term-­‐term  matrix 19.3 • DENSE VECTORS AND SVD 13 Singular Value Decomposition (SVD) is a method for finding the most impor- tant dimensions of a data set, those dimensions along which the data varies the most. It can be applied to any rectangular matrix and in language processing it was first applied to the task of generating embeddings from term-document matrices by Deer- wester et al. (1988) in a model called Latent Semantic Indexing. In this section let’s look just at its application to a square term-context matrix M with |V | rows (one for each word) and columns (one for each context word) SVD factorizes M into the product of three square |V |⇥ |V | matrices W , S, and CT . In W each row still represents a word, but the columns do not; each column now represents a dimension in a latent space, such that the |V | column vectors are orthogonal to each other and the columns are ordered by the amount of variance in the original dataset each accounts for. S is a diagonal |V |⇥ |V | matrix, with singular values along the diagonal, expressing the importance of each dimension. The |V |⇥ |V | matrix CT still represents contexts, but the rows now represent the new latent dimensions and the |V | row vectors are orthogonal to each other. By using only the first k dimensions, of W, S, and C instead of all |V | dimen- sions, the product of these 3 matrices becomes a least-squares approximation to the original M. Since the first dimensions encode the most variance, one way to view the reconstruction is thus as modeling the most important information in the original dataset. SVD applied to co-occurrence matrix X: 2 666664 X 3 777775 |V |⇥ |V | = 2 666664 W 3 777775 |V |⇥ |V | 2 666664 s1 0 0 . . . 0 0 s2 0 . . . 0 0 0 s3 . . . 0 ... ... ... . . . ... 0 0 0 . . . sV 3 777775 |V |⇥ |V | 2 666664 C 3 777775 |V |⇥ |V | Taking only the top k dimensions after SVD applied to co-occurrence matrix X: 2 666664 X 3 777775 |V |⇥ |V | = 2 666664 W 3 777775 |V |⇥ k 2 666664 s1 0 0 . . . 0 0 s2 0 . . . 0 0 0 s3 . . . 0 ... ... ... . . . ... 0 0 0 . . . sk 3 777775 k⇥ k h C i k⇥ |V | Figure 19.11 SVD factors a matrix X into a product of three matrices, W, S, and C. Taking the first k dimensions gives a |V |⇥k matrix Wk that has one k-dimensioned row per word that can be used as an embedding. Using only the top k dimensions (corresponding to the k most important singular values), leads to a reduced |V |⇥k matrix Wk, with one k-dimensioned row per word. This row now acts as a dense k-dimensional vector (embedding) representing that word, substituting for the very high-dimensional rows of the original M.3 3 Note that early systems often instead weighted Wk by the singular values, using the product Wk ·Sk as an embedding instead of just the matrix Wk , but this weighting leads to significantly worse embeddings (Levy et al., 2015). 62 (I’m  simplifying  here  by  assuming  the  matrix  has  rank  |V|) Dan  Jurafsky Embeddings versus  sparse  vectors • Dense  SVD  embeddings sometimes  work  better  than   sparse  PPMI  matrices  at  tasks  like  word  similarity • Denoising:  low-­‐order  dimensions  may  represent  unimportant   information • Truncation  may  help  the  models  generalize  better  to  unseen  data. • Having  a  smaller  number  of  dimensions  may  make  it  easier  for   classifiers  to  properly  weight  the  dimensions  for  the  task. • Dense  models  may  do  better  at  capturing  higher  order  co-­‐ occurrence.   65 Vector  Semantics Embeddings inspired  by   neural  language  models:   skip-­‐grams  and  CBOW Dan  Jurafsky Prediction-­‐based  models: An  alternative  way  to  get  dense  vectors • Skip-­‐gram (Mikolov et  al.  2013a)    CBOW (Mikolov et  al.  2013b) • Learn  embeddings as  part  of  the  process  of  word  prediction. • Train  a  neural  network  to  predict  neighboring  words • Inspired  by  neural  net  language  models. • In  so  doing,  learn  dense  embeddings for  the  words  in  the  training  corpus. • Advantages: • Fast,  easy  to  train  (much  faster  than  SVD) • Available  online  in  the  word2vec package • Including  sets  of  pretrained embeddings!67 Dan  Jurafsky Setup • Walking  through  corpus  pointing  at  word  w(t),  whose  index  in   the  vocabulary  is  j,  so  we’ll  call  it  wj (1  <  j  <  |V  |).   • Let’s  predict  w(t+1)  ,  whose  index  in  the  vocabulary  is  k  (1  <  k  <   |V  |).  Hence  our  task  is  to  compute  P(wk|wj).   70 Dan  Jurafsky One-­‐hot  vectors • A  vector  of  length  |V|   • 1  for  the  target  word  and  0  for  other  words • So  if  “popsicle”  is  vocabulary  word  5 • The  one-­‐hot  vector  is • [0,0,0,0,1,0,0,0,0…….0] 71 Dan  Jurafsky Input layer Projection layer Output layer W |V|⨉d wt wt-1 wt+1 1-hot input vector 1⨉d1⨉|V| embedding for wt probabilities of context words W’ d ⨉ |V| W’ d ⨉ |V| x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V| 72 Skip-­‐gram Dan  Jurafsky Turning  outputs  into  probabilities • ok =  v’k·∙vj • We  use  softmax to  turn  into  probabilities 75 19.4 • EMBEDDINGS FROM PREDICTION: SKIP-GRAM AND CBOW 15 Input layer Projection layer Output layer W |V|⨉d wt wt-1 wt+1 1-hot input vector 1⨉d1⨉|V| embedding for wt probabilities of context words W’ d ⨉ |V| W’ d ⨉ |V| x1 x2 xj x|V| y1 y2 yk y|V| y1 y2 yk y|V| Figure 19.12 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a). We begin with an input vector x, which is a one-hot vector for the current word w j (hence x j = 1, and xi = 0 8i 6= j). We then predict the probability of each of the 2C output words—in Fig. 17.12 that means the two output words wt1 and wt+1— in 3 steps: 1. x is multiplied by W , the input matrix, to give the hidden or projection layer.projection layer Since each column of the input matrix W is just an embedding for word wt , and the input is a one-hot vector for w j, the projection layer for input x will be h = v j, the input embedding for w j. 2. For each of the 2C context words we now multiply the projection vector h by the output matrix W 0. The result for each context word, o = W 0h, is a 1⇥ |V | dimensional output vector giving a score for each of the |V | vocabulary words. In doing so, the element ok was computed by multiplying h by the output embedding for word wk: ok = v0kh. 3. Finally, for each context word we normalize this score vector, turning the score for each element ok into a probability by using the soft-max function: p(wk|w j) = exp(v0k · v j)P w02|V | exp(v0w · v j) (19.24) The next section explores how the embeddings, the matrices W and W 0, are learned. Once they are learned, we’ll have two embeddings for each word wi: vi and v0i. We can just choose to use the input embedding vi from W , or we can add the two and use the embedding vi + v0i as the new d-dimensional embedding, or we can concatenate them into an embedding of dimensionality 2d. As with the simple count-based methods like PPMI, the context window size C effects the performance of skip-gram embeddings, and experiments often tune the parameter C on a dev set. As as with PPMI, window sizing leads to qualitative differences: smaller windows capture more syntactic information, larger ones more semantic and relational information. One difference from the count-based methods Dan  Jurafsky Embeddings from  W  and  W’ • Since  we  have  two  embeddings,  vj and  v’j for  each  word  wj • We  can  either: • Just  use  vj • Sum  them • Concatenate  them  to  make  a  double-­‐length  embedding 76 Dan  Jurafsky But  wait;  how  do  we  learn  the  embeddings? 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 Learning the input and output embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, Text. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(tC), ...,w(t1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the vocabulary. The actual training objective for skip-gram, the negative sampling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov et al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) shows that skip-gram’s optimal value occurs 77 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neig boring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 Learning the input and outpu embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. Th goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, Text. argmax q og p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(tC), ...,w(t1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the vocabulary. The actual training objective for skip-gram, the negative sampling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov t al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldb rg (2014b) shows that skip-gr m’s optimal value occurs 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 Learning the input and output embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the paramet rs q ) hat do well at predicting the context wo ds, m ximizing the log likelihood of the corpus, Text. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q l g TY t=1 p(w(tC), ...,w(t1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume th t the pr babilities of each context (output) ord is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context ords w(t+ j), while minimizing the si ilarity betwe n word w(t) and all the wor s in the vocabulary. The actual training objective for skip-gram, the negative sa pling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov et al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) shows that skip-gram’s optimal value occurs 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the ore co putation the algorithm requires for training (more neighboring words must be predicted). ee the end of the chapter for a pointer to surveys which have explored para t i ti like indow- size for different ta ks. 19.4.1 Le r i g he input and output e e i s There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of t e model is to le r representations (the em edding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, Text. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(tC), ...,w(t1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the oth r outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are looking to set the parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the vocabulary. The actual training objective for skip-gram, the negative sampling approach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described in Chapter 5 (Mikolov et al., 2013a). There is an interesting relationship between skip-grams, SVD/LSA, and PPMI. If we multipl the two con ext matrices W ·W 0T , we produce |V |⇥ |V | ma rix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) sho s that skip-gram’s optimal value occurs 16 CHAPTER 19 • VECTOR SEMANTICS is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighb ring words must be predicted). See the end of the chapter for a pointer to surveys which have explored parameterizations like window- size for different tasks. 19.4.1 L arni g the i put and output embeddings There are various ways to learn skip-grams; we’ll sketch here just the outline of a simple version based on Eq. 17.24. The goal of the model is to learn representations (the embedding matrices W and W 0; we’ll refer to them collectively as the parameters q ) that do well at predicting the context words, maximizing the log likelihood of the corpus, T xt. argmax q log p(Text) (19.25) We’ll first make the naive bayes assumptions that the input word at time t is independent of the other input words, argmax q log TY t=1 p(w(tC), ...,w(t1),w(t+1), ...,w(t+C)) (19.26) We’ll also assume that the probabilities of each context (output) word is independent of the other outputs: argmax q X c jc, j 6=0 log p(w(t+ j)|w(t)) (19.27) We now substitute in Eq. 17.24: = argmax q TX t=1 X c jc, j 6=0 log exp(v0(t+ j) · v(t))P w2|V | exp(v0w · v(t)) (19.28) With some rearrangements:: = argmax q TX t=1 X c jc, j 6=0 2 4v0(t+ j) · v(t) log X w2|V | exp(v0w · v(t)) 3 5 (19.29) Eq. 17.29 shows that we are lookin to set th parameters q (the embedding matrices W and W 0) in a way that maximizes the similarity between each word w(t) and its nearby context words w(t+ j), while minimizing the similarity between word w(t) and all the words in the v cabulary. The actual training objective for skip-gram, the negative sampling app oach, is somewhat different; because it’s so time-consuming to sum over all the words in the vocabulary V , the algorithm merely chooses a few negative samples to minimize rather than every word. The training proceeds by stochastic gradient descent, using error backpropagation as described i Chapter 5 (Mikolov et l., 2013a). There is an i teresting relationship b tween skip-g ams, SVD/LSA, and PPMI. If we multiply the two context matrices W ·W 0T , we produce a |V |⇥ |V | matrix X , each entry mi j corresponding to some association between input word i and output word j. Levy and Goldberg (2014b) shows that skip-gram’s optimal value occurs Dan  Jurafsky Properties  of  embeddings 80 • Nearest  words  to  some  embeddings (Mikolov et  al.  20131) 18 CHAPTER 19 • VECTOR SEMANTICS matrix is repeated between each one-hot input and the projection layer h. For the case of C = 1, these two embeddings must be combined into the projection layer, which is done by multiplying each one-hot context vector x by W to give us two input vectors (let’s say vi and v j). We then average these vectors h = W · 1 2C X c jc, j 6=0 v( j) (19.31) As with skip-grams, the the projection vector h is multiplied by the output matrix W 0. The result o = W 0h is a 1⇥ |V | dimensional output vector giving a score for each of the |V | words. In doing so, the element ok was computed by multiplying h by the output embedding for word wk: ok = v0kh. Finally we normalize this score vector, turning the score for each element ok into a probability by using the soft-max function. 19.5 Properties of embeddings We’ll discuss in Section 17.8 how to evaluate the quality of different embeddings. But it is also sometimes helpful to visualize them. Fig. 17.14 shows the words/phrases that are most similar to some sample words using the phrase-based version of the skip-gram algorithm (Mikolov et al., 2013a). target: Redmond Havel ninjutsu graffiti capitulate Redmond Wash. Vaclav Havel ninja spray paint capitulation Redmond Washington president Vaclav Havel martial arts grafitti capitulated Microsoft Velvet Revolution swordsmanship taggers capitulating Figure 19.14 Examples of the closest tokens to some target words using a phrase-based extension of the skip-gram algorithm (Mikolov et al., 2013a). One semantic property of various kinds of embeddings that may play in their usefulness is their ability to capture relational meanings Mikolov et al. (2013b) demonstrates that the offsets between vector embeddings can capture some relations between words, for example that the result of the ex- pression vector(‘king’) - vector(‘man’) + vector(‘woman’) is a vector close to vec- tor(‘queen’); the left panel in Fig. 17.15 visualizes this by projecting a representation down into 2 dimensions. Similarly, they found that the expression vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vec- tor(‘Rome’). Levy and Goldberg (2014a) shows that various other kinds of em- beddings also seem to have this property. We return in the next section to these relational properties of embeddings and how they relate to meaning compositional- ity: the way the meaning of a phrase is built up out of the meaning of the individual vectors. 19.6 Compositionality in Vector Models of Meaning To be written. Dan  Jurafsky Embeddings capture  relational  meaning! vector(‘king’)  -­‐ vector(‘man’)  +  vector(‘woman’)   ≈  vector(‘queen’) vector(‘Paris’)  -­‐ vector(‘France’)  +  vector(‘Italy’)  ≈ vector(‘Rome’) 81 Z twin sentences sian crt SS probability wat OT OMT Male es es soolMOCE] £3 = ai 2 en es A Ate “oe & a art Sinton ate ee —-on. az rule “Le fis | eee ~PSitesg | Esai? ant mat go Pee we algorithm. human Pepresentation pS semantio. WORUSE = ie os represent aasis @E "Se lariglagessan ri =nuinher=s<3a ee egwven dialogue = _ using i np lexical => oo we Hi models => rom a) Vector Semantics Brown clustering Dan  Jurafsky Brown  Clusters  as  vectors • By  tracing  the  order  in  which  clusters  are  merged,  the  model   builds  a  binary  tree  from  bottom  to  top. • Each  word  represented  by  binary  string  =  path  from  root  to  leaf • Each  intermediate  node  is  a  cluster   • Chairman  is  0010,  “months”  =  01,  and  verbs  =  1 85 Brown Algorithm • Words merged according to contextual similarity • Clusters are equivalent to bit-string prefixes • Prefix length determines the granularity of the clustering 011 president walk run sprint chairman CEO November October 0 1 00 01 00110010 001 10 11 000 100 101010 Dan  Jurafsky Brown  cluster  examples 20 CHAPTER 19 • VECTOR SEMANTICSBrown Algorithm • Words merged according to contextual similarity • Clusters are equivalent to bit-string prefixes • Prefix length determines the granularity of the clustering 011 president walk run sprint chairman CEO November October 0 1 00 01 00110010 001 10 11 000 100 101010 Figure 19.16 Brown clustering as a binary tree. A full binary string represents a word; each binary prefix represents a larger class to which the word belongs and can be used as an vector representation for the word. After Koo et al. (2008). After clustering, a word can be represented by the binary string that corresponds to its path from the root node; 0 for left, 1 for right, at each choice point in the binary tree. For example in Fig. 19.16, the word chairman is the vector 0010 and October is 011. Since Brown clustering is a hard clustering algorithm (each word has onlyhard clustering cluster), there is just one string per word. Now we can extract useful features by taking the binary prefixes of this bit string; each prefix represents a cluster to which the word belongs. For example the string 01 in the figure represents the cluster of month names {November, October}, the string 0001 the names of common nouns for corporate executives {chairman, president}, 1 is verbs {run, sprint, walk}, and 0 is nouns. These prefixes can then be used as a vector representation for the word; the shorter the prefix, the more abstract the cluster. The length of the vector representation can thus be adjusted to fit the needs of the particular task. Koo et al. (2008) improving parsing by using multiple features: a 4-6 bit prefix to capture part of speech information and a full bit string to represent words. Spitkovsky et al. (2011) shows that vectors made of the first 8 or 9-bits of a Brown clustering perform well at grammar induction. Because they are based on immediately neighboring words, Brown clusters are most commonly used for representing the syntactic properties of words, and hence are commonly used as a feature in parsers. Nonetheless, the clusters do represent some semantic properties as well. Fig. 19.17 shows some examples from a large clustering from Brown et al. (1992). Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August pressure temperature permeability density porosity stress velocity viscosity gravity tension anyone someone anybody somebody had hadn’t hath would’ve could’ve should’ve must’ve might’ve asking telling wondering instructing informing kidding reminding bothering thanking deposing mother wife father son husband brother daughter sister boss uncle great big vast sudden mere sheer gigantic lifelong scant colossal down backwards ashore sideways southward northward overboard aloft downwards adrift Figure 19.17 Some sample Brown clusters from a 260,741-word vocabulary trained on 366 million words of running text (Brown et al., 1992). Note the mixed syntactic-semantic nature of the clusters. Note that the naive version of the Brown clustering algorithm described above is extremely inefficient — O(n5): at each of n iterations, the algorithm considers each of n2 merges, and for each merge, compute the value of the clustering by summing over n2 terms. because it has to consider every possible pair of merges. In practice we use more efficient O(n3) algorithms that use tables to pre-compute the values for each merge (Brown et al. 1992, Liang 2005). 86 Dan  Jurafsky Class-­‐based  language  model • Suppose  each  word  was  in  some  class  ci: 87 19.7 • BROWN CLUSTERING 19 Figure 19.15 Vector offsets showing relational properties of the vector space, shown by projecting vectors onto two dimensions using PCA. In the left panel, ’king’ - ’man’ + ’woman’ is close to ’queen’. In the right, we see the way offsets seem to capture grammatical number. (Mikolov et al., 2013b). 19.7 Brown Clustering Brown clustering (Brown et al., 1992) is an agglomerative clustering algorithm forBrown clustering deriving vector representations of words by clustering words based on their associa- tions with the preceding or following words. The algorithm makes use of the class-based language model (Brown et al.,class-based language model 1992), a model in which each word w 2V belongs to a class c 2C with a probability P(w|c). Class based LMs assigns a probability to a pair of words wi1 and wi by modeling the transition between classes rather than between words: P(wi|wi1) = P(ci|ci1)P(wi|ci) (19.32) The class-based LM can be used to assign a probability to an entire corpus given a particularly clustering C as follows: P(corpus|C) = nY i1 P(ci|ci1)P(wi|ci) (19.33) Class-based language models are generally not used as a language model for ap- plications like machine translation or speech recognition because they don’t work as well as standard n-grams or neural language models. But they are an important component in Brown clustering. Brown clustering is a hierarchical clustering algorithm. Let’s consider a naive (albeit inefficient) version of the algorithm: 1. Each word is initially assigned to its own cluster. 2. We now consider consider merging each pair of clusters. The pair whose merger results in the smallest decrease in the likelihood of the corpus (accord- ing to the class-based language model) is merged. 3. Clustering proceeds until all words are in one big cluster. Two words are thus most likely to be clustered if they have similar probabilities for preceding and following words, leading to more coherent clusters. The result is that words will be merged if they are contextually similar. By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top, in which the leaves are the words in the vocabulary, and each intermediate node in the tree represents the cluster that is formed by merging its children. Fig. 19.16 shows a schematic view of a part of a tree. 19.7 • BROWN CLUSTERING 19 Figure 19.15 Vector offsets showing relational properties of the vector space, shown by projecting vectors onto two dimensions using PCA. In the left panel, ’king’ - ’man’ + ’woman’ is close to ’queen’. In the right, we see the way offsets seem to capture grammatical number. (Mikolov et al., 2013b). 19.7 Brown Clustering Brown clustering (Brown et al., 1992) is an agglomerative clustering algorithm forBrown clustering deriving vector representations of words by clustering words based on their associa- tions with the preceding or following words. The algorithm makes use of the class-based language model (Brown et al.,class-based language model 1992), a model in which each word w 2V belongs to a class c 2C with a probability P(w|c). Class based LMs assigns a probability to a pair of words wi1 and i by modeling the trans tion between classes rather than between words: P(wi|wi1) = P(ci|ci1)P(wi|ci) (19.32) The class-based LM can be used to assign a probability to an entire corpus given a particularly clustering C as follows: P(corpus|C) = nY i1 P(ci|ci1)P(wi|ci) (19.33) Class-based language models are generally not used as a language model for ap- plications like machine translation or speech recognition because they don’t work as well as standard n-grams or neural language models. But they are an important component in Brown clustering. Brown clustering is a hierarchical clustering algorithm. Let’s consider a naive (albeit inefficient) version of the algorithm: 1. Each word is initially assigned to its own cluster. 2. We now consider consider merging each pair of clusters. The pair whose merger results in the smallest decrease in the likelihood of the corpus (accord- ing to the class-based language model) is merged. 3. Clustering proceeds unt l all words are i one big cluster. Two ords are thus mo t likely to be clustered if they have similar probabilities for preceding and following words, leading to more coherent clusters. The re ult is that words will be merged if they are contextually similar. By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top, in which the leaves are the words in the vocabulary, and each intermediate node in the tree represents the cluster that is formed by merging its childre . Fig. 19.16 shows a schematic view f a part of a t e.