Download Language Modeling - Introduction to Machine Learning - Lecture Notes and more Study notes Machine Learning in PDF only on Docsity! CSC2515 FALL 2008 INTRODUCTION TO MACHINE LEARNING APPLICATIONS OF MACHINE LEARNING TO LANGUAGE MODELING 1 Statistical language modelling • Goal: Model the joint distribution of words in a sentence. • Such a model can be used to – predict the next word given several preceding ones – arrange bags of words into sentences – assign probabilities to documents • Applications: speech recognition, machine translation, information retrieval. • Most statistical language models are based on the Markov assumption: – The distribution of the next word depends on only n words that immediately precede it. – This assumption is clearly wrong but useful – it makes the task much more tractable. 2 Why n-gram models are hopeless for large n • n-gram models don’t take advantage of the fact that some words are used in similar ways. • Suppose you know that words snow and rain are used in similar ways, as are Monday and Tuesday. • If you are told that the following sentence is probable: – It’s going to rain on Monday. • Then you can infer that the following sentence is also probable: – It’s going to snow on Tuesday. • n-gram models cannot generalize this way because all words are treated as arbitrary symbols, with each word being equally (dis)similar to all others. • Using distributed representations for words allows similarity between words to be captured. 5 Distributed representations • Estimation of high-dimensional discrete distributions from data is hard. – the number of parameters is exponential – no a priori smoothness constraint on parameters / probabilities • Estimation of distributions over continuous spaces is easier due to automatic smoothing. • Idea: map discrete inputs to continuous vectors and learn a smooth function that maps them to probability distributions. • Used for language modelling with neural nets and Bayes nets. 6 Word representations embedded in 2D (I)
miles
spart
Pa hour ts
evision
80 go 57am mee bei
SORE “mine piistaound day
eT aecent"M®taneP™"
m hairneniBO"Rrvco
“national reponse i
narteBnary TNE GEIE ho - .
ani police gion
"BR, formse seucag@0"¥6l So aaa, Bion an
boscieS"*bercen prison “ote ahh vey
"esc al cower ae
shis 1088 Pow eown sens first eve \
+38 tee onoyryeond MOFC® | leaders gy HE oho
overuse ng ay "SUPP “POM sone i a an
SBE “PORE ptectammpaign “hit af EES og 22.18 Mong a8
“ey ‘ale “Bua sates aheln “seta oe
es “SIRE jgoc tans
3 say ly” “SIE eae ings *number i
swore toca
ital ation “Pop
soci sealant 25° AEE oe
on ores intemational valor possiblit= any’ ety
* “ARDEP BRS a, “Ha pcShPesmall volt ssamend
+simpaar soung ee dat pf
gpa daily, SMP soar «family «whossew
stoi"! sjomn «plant Bl sunken
*Counenge white HESS, gy necting, sretused Proper nouns
Beate ey, sstote Thea emaetng) lo Nauta “HM “Pes heck ta
house «Sy sae scueon see. Bho ao hs
starr sproup swoon Fhe
soongrenation TOES ti eat, ae foram bob ving madhmigmt | “had
esMeemerica us, spgant Step atta a vet ram aa af
. i *intorngiign one PR i, = eh snrs smutoesbong to
SEMAN IUSSHopon achinaliter use sous programs waavia —— eshoudd eo"
“an admiledahy dif. ok ang OES ir "ota “over “proseal“
. sna “Bhrore Tome gtges ecru
sbrownmlexandehew hampshite sdospile slike boss ob
but, i VEE ng
au Aue re “such, she 17 SP sohane
“FEB “nen iggy Pah sec Slosbit shorethore
Ny we ADAM ise saround
ec ith gatlanguah
sor Soars sin, tothe
sen eR . to de
f= BE ey Piaggio ae ae
re er He A nan
yTepor i se saith eased, stoidasked ic ‘8 has pe
newepanebmgglis steal snowovllecatea4% cout “wang seoes swan hi Blcane
stimes eter ron ved St ay ne
suc a .
a vaooncd trying been
ee sone O88
thay
march
“apn b
*y
Neural Probabilistic Language Model (Bengio et al., 2000) • The original and still the most popular neural language model. • A lookup table is used to map context words to feature vectors. • Architecture: 1-hidden layer neural net – Input: sequence of the context word feature vectors. – Output: distribution over the next word (softmax over words). • Outperforms n-gram models on small (∼ 1M words) datasets. • For better results, predictions of a NPLM are interpolated with those of an n-gram model. 10 Neural Probabilistic Language Model
softmax: P(w=i|w,, W,, W,)
l hidden layer
11
Log-bilinear model (Mnih & Hinton, 2007) • The LBL model is similar to the NPLM, but is simpler and slightly faster. – Does not have non-linearities. • Given the context w1:n−1, the LBL model predicts the representation for the next word wn by linearly combining the representations for the context words: r̂ = n−1∑ i=1 Cirwi • Then the distribution for the next word is computed based on the similarity between the predicted representation and the representations of all words in the vocabulary: P (wn = w|w1:n−1) = exp(r̂Trw)∑ j exp(r̂ Trj) . 12 Approaches to tree construction • The approach of Morin and Bengio: – Start with the WordNet IS-A hierarchy (which is a DAG) – Manually select one parent node per word – Use clustering to make the resulting tree binary – Use the NPLM model for making the left/right decisions • Drawbacks: tree construction uses expert knowledge; the resulting model does not work as well as its non-hierarchical counterpart. • An alternative (Mnih & Hinton, 2008): – Construct the word tree from data alone (no experts needed) – Allow each word to occur more than once in the tree – Use the simplified log-bilinear language model for making the left/right decisions 15 Hierarchical log-bilinear model (Mnih & Hinton, 2008) • Let d be the binary string / code that encodes the sequence of left-right decisions in the tree that lead to word w. • Each non-leaf node in the tree is given a feature vector that captures the difference between the words in its left and right subtrees. • The probability of taking the left branch at a particular node is given by P (di = 1|qi, w1:n−1) = σ(r̂ Tqi), where r̂ is computed as in the LBL model and qi is the feature vector for the node. • Then the probability of word w being the next word is simply the probability of d under the binary decision model: P (wn = w|w1:n−1) = ∏ i P (di|qi, w1:n−1). 16 Data-driven tree construction • We would like to cluster words based on the distribution of contexts in which they occur. • This distribution is hard to estimate and work with due to the high dimensionality of the space of contexts (the same sparsity problem n-gram models suffer from). • To avoid this problem, we represent contexts using distributed representations and cluster words based on their expected context representation. • To construct a word tree: 1. Train a model using a random (balanced) tree over words. 2. Compute the expected predicted representation over all occurrences of the given word. 3. Perform hierarchical clustering on these expected representations. 17 Random vs. non-random trees The effect of the feature dimensionality and the tree-building algorithm on the test set perplexity of the model. Feature Perplexity using Perplexity using Reduction dimensionality a RANDOM tree a BALANCED tree in perplexity 25 191.6 162.4 29.2 50 166.4 141.7 24.7 75 156.4 134.8 21.6 100 151.2 131.3 19.9 20 Model evaluation Perplexity on the test set: Model Tree generating Perplexity Minutes type algorithm per epoch HLBL RANDOM 151.2 4 HLBL BALANCED 131.3 4 HLBL ADAPTIVE 127.0 4 HLBL ADAPTIVE(0.25) 124.4 6 HLBL ADAPTIVE(0.4) 123.3 7 HLBL ADAPTIVE(0.4) × 2 115.7 16 HLBL ADAPTIVE(0.4) × 4 112.1 32 LBL – 117.0 6420 KN3 – 129.8 – KN5 – 123.2 – • LBL and HLBL used 100D feature vectors and a context size of 5. • KNn is a Kneser-Ney n-gram model. 21 Observations • Hierarchical distributed language models can outperform non-hierarchical models when they use sufficiently well-constructed trees over words. – Expert knowledge is not needed for building good trees. – Allowing words to occur more than once in a tree is essential for good performance. • Even when very large trees are used, the hierarchical LBL model is more than two orders of magnitude faster than the LBL model. 22