Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Language Modeling - Introduction to Machine Learning - Lecture Notes, Study notes of Machine Learning

Main points of this lecture are: Language Modeling, Statistical Language Modelling, Language Models, Distribution, Gram Models, Distributed Representations, High-Dimensional, Estimation Of Distributions, Word Representations, Neural Language

Typology: Study notes

2012/2013

Uploaded on 04/30/2013

bassu
bassu 🇮🇳

4.5

(42)

145 documents

1 / 23

Toggle sidebar

Related documents


Partial preview of the text

Download Language Modeling - Introduction to Machine Learning - Lecture Notes and more Study notes Machine Learning in PDF only on Docsity! CSC2515 FALL 2008 INTRODUCTION TO MACHINE LEARNING APPLICATIONS OF MACHINE LEARNING TO LANGUAGE MODELING 1 Statistical language modelling • Goal: Model the joint distribution of words in a sentence. • Such a model can be used to – predict the next word given several preceding ones – arrange bags of words into sentences – assign probabilities to documents • Applications: speech recognition, machine translation, information retrieval. • Most statistical language models are based on the Markov assumption: – The distribution of the next word depends on only n words that immediately precede it. – This assumption is clearly wrong but useful – it makes the task much more tractable. 2 Why n-gram models are hopeless for large n • n-gram models don’t take advantage of the fact that some words are used in similar ways. • Suppose you know that words snow and rain are used in similar ways, as are Monday and Tuesday. • If you are told that the following sentence is probable: – It’s going to rain on Monday. • Then you can infer that the following sentence is also probable: – It’s going to snow on Tuesday. • n-gram models cannot generalize this way because all words are treated as arbitrary symbols, with each word being equally (dis)similar to all others. • Using distributed representations for words allows similarity between words to be captured. 5 Distributed representations • Estimation of high-dimensional discrete distributions from data is hard. – the number of parameters is exponential – no a priori smoothness constraint on parameters / probabilities • Estimation of distributions over continuous spaces is easier due to automatic smoothing. • Idea: map discrete inputs to continuous vectors and learn a smooth function that maps them to probability distributions. • Used for language modelling with neural nets and Bayes nets. 6 Word representations embedded in 2D (I) miles spart Pa hour ts evision 80 go 57am mee bei SORE “mine piistaound day eT aecent"M®taneP™" m hairneniBO"Rrvco “national reponse i narteBnary TNE GEIE ho - . ani police gion "BR, formse seucag@0"¥6l So aaa, Bion an boscieS"*bercen prison “ote ahh vey "esc al cower ae shis 1088 Pow eown sens first eve \ +38 tee onoyryeond MOFC® | leaders gy HE oho overuse ng ay "SUPP “POM sone i a an SBE “PORE ptectammpaign “hit af EES og 22.18 Mong a8 “ey ‘ale “Bua sates aheln “seta oe es “SIRE jgoc tans 3 say ly” “SIE eae ings *number i swore toca ital ation “Pop soci sealant 25° AEE oe on ores intemational valor possiblit= any’ ety * “ARDEP BRS a, “Ha pcShPesmall volt ssamend +simpaar soung ee dat pf gpa daily, SMP soar «family «whossew stoi"! sjomn «plant Bl sunken *Counenge white HESS, gy necting, sretused Proper nouns Beate ey, sstote Thea emaetng) lo Nauta “HM “Pes heck ta house «Sy sae scueon see. Bho ao hs starr sproup swoon Fhe soongrenation TOES ti eat, ae foram bob ving madhmigmt | “had esMeemerica us, spgant Step atta a vet ram aa af . i *intorngiign one PR i, = eh snrs smutoesbong to SEMAN IUSSHopon achinaliter use sous programs waavia —— eshoudd eo" “an admiledahy dif. ok ang OES ir "ota “over “proseal“ . sna “Bhrore Tome gtges ecru sbrownmlexandehew hampshite sdospile slike boss ob but, i VEE ng au Aue re “such, she 17 SP sohane “FEB “nen iggy Pah sec Slosbit shorethore Ny we ADAM ise saround ec ith gatlanguah sor Soars sin, tothe sen eR . to de f= BE ey Piaggio ae ae re er He A nan yTepor i se saith eased, stoidasked ic ‘8 has pe newepanebmgglis steal snowovllecatea4% cout “wang seoes swan hi Blcane stimes eter ron ved St ay ne suc a . a vaooncd trying been ee sone O88 thay march “apn b *y Neural Probabilistic Language Model (Bengio et al., 2000) • The original and still the most popular neural language model. • A lookup table is used to map context words to feature vectors. • Architecture: 1-hidden layer neural net – Input: sequence of the context word feature vectors. – Output: distribution over the next word (softmax over words). • Outperforms n-gram models on small (∼ 1M words) datasets. • For better results, predictions of a NPLM are interpolated with those of an n-gram model. 10 Neural Probabilistic Language Model softmax: P(w=i|w,, W,, W,) l hidden layer 11 Log-bilinear model (Mnih & Hinton, 2007) • The LBL model is similar to the NPLM, but is simpler and slightly faster. – Does not have non-linearities. • Given the context w1:n−1, the LBL model predicts the representation for the next word wn by linearly combining the representations for the context words: r̂ = n−1∑ i=1 Cirwi • Then the distribution for the next word is computed based on the similarity between the predicted representation and the representations of all words in the vocabulary: P (wn = w|w1:n−1) = exp(r̂Trw)∑ j exp(r̂ Trj) . 12 Approaches to tree construction • The approach of Morin and Bengio: – Start with the WordNet IS-A hierarchy (which is a DAG) – Manually select one parent node per word – Use clustering to make the resulting tree binary – Use the NPLM model for making the left/right decisions • Drawbacks: tree construction uses expert knowledge; the resulting model does not work as well as its non-hierarchical counterpart. • An alternative (Mnih & Hinton, 2008): – Construct the word tree from data alone (no experts needed) – Allow each word to occur more than once in the tree – Use the simplified log-bilinear language model for making the left/right decisions 15 Hierarchical log-bilinear model (Mnih & Hinton, 2008) • Let d be the binary string / code that encodes the sequence of left-right decisions in the tree that lead to word w. • Each non-leaf node in the tree is given a feature vector that captures the difference between the words in its left and right subtrees. • The probability of taking the left branch at a particular node is given by P (di = 1|qi, w1:n−1) = σ(r̂ Tqi), where r̂ is computed as in the LBL model and qi is the feature vector for the node. • Then the probability of word w being the next word is simply the probability of d under the binary decision model: P (wn = w|w1:n−1) = ∏ i P (di|qi, w1:n−1). 16 Data-driven tree construction • We would like to cluster words based on the distribution of contexts in which they occur. • This distribution is hard to estimate and work with due to the high dimensionality of the space of contexts (the same sparsity problem n-gram models suffer from). • To avoid this problem, we represent contexts using distributed representations and cluster words based on their expected context representation. • To construct a word tree: 1. Train a model using a random (balanced) tree over words. 2. Compute the expected predicted representation over all occurrences of the given word. 3. Perform hierarchical clustering on these expected representations. 17 Random vs. non-random trees The effect of the feature dimensionality and the tree-building algorithm on the test set perplexity of the model. Feature Perplexity using Perplexity using Reduction dimensionality a RANDOM tree a BALANCED tree in perplexity 25 191.6 162.4 29.2 50 166.4 141.7 24.7 75 156.4 134.8 21.6 100 151.2 131.3 19.9 20 Model evaluation Perplexity on the test set: Model Tree generating Perplexity Minutes type algorithm per epoch HLBL RANDOM 151.2 4 HLBL BALANCED 131.3 4 HLBL ADAPTIVE 127.0 4 HLBL ADAPTIVE(0.25) 124.4 6 HLBL ADAPTIVE(0.4) 123.3 7 HLBL ADAPTIVE(0.4) × 2 115.7 16 HLBL ADAPTIVE(0.4) × 4 112.1 32 LBL – 117.0 6420 KN3 – 129.8 – KN5 – 123.2 – • LBL and HLBL used 100D feature vectors and a context size of 5. • KNn is a Kneser-Ney n-gram model. 21 Observations • Hierarchical distributed language models can outperform non-hierarchical models when they use sufficiently well-constructed trees over words. – Expert knowledge is not needed for building good trees. – Allowing words to occur more than once in a tree is essential for good performance. • Even when very large trees are used, the hierarchical LBL model is more than two orders of magnitude faster than the LBL model. 22