Natural Language Processing, Study notes of Natural Language Processing (NLP)

Revised version of 2019-2020, of Mumbai University. The Topics covered are Introduction, Morphological Analysis, Syntactic Analysis, Semantic analysis, Discourse and Pragmatic Analysis

Typology: Study notes

2019/2020

Uploaded on 10/01/2020

shiva-ganapuram
shiva-ganapuram 🇮🇳

4

(3)

1 document

1 / 48

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
NLP
Introduction
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field of research and application that determines the
way computers can be used to understand and manage natural language text or speech to do
useful things.
Natural language processing (NLP) can be defined as the automatic (or semi-automatic)
processing of human language. The term ‘NLP’ is sometimes used rather more narrowly than
that, often excluding information retrieval and sometimes even excluding machine
translation. NLP is sometimes contrasted with ‘computational linguistics’, with NLP being
thought of as more applied. Nowadays, alternative terms are often preferred, like ‘Language
Technology’ or ‘Language Engineering’. Language is often used in contrast with speech
(e.g., Speech and Language Technology). NLP is essentially multidisciplinary: it is closely
related to linguistics. It also has links to research in cognitive science, psychology,
philosophy and maths (especially logic). Within CS, it relates to formal language theory,
compiler techniques, theorem proving, machine learning and human-computer interaction. Of
course it is also related to AI, though nowadays it’s not generally thought of as part of AI.
Components of NLP
Natural Language Understanding
Taking some spoken/typed sentence and working out what it means
• Mapping the given input in the natural language into a useful representation
Different level of analysis required: morphological analysis, syntactic analysis,
semantic analysis, discourse analysis
Natural Language Generation
Taking some formal representation of what you want to say and working out a way to
express it in a natural (human) language (e.g., English)
• Producing output in the natural language from some internal representation
• Different level of synthesis required: deep planning (what to say), syntactic generation
• NLG is the process of constructing natural language outputs from non-linguistic inputs
• NLG can be viewed as the reverse process of NL understanding
• A NLG system may have two main parts: • Discourse Planner what will be generated,
which sentences • Surface Realizer - realizes a sentence from its internal representation
• Lexical Selection - selecting the correct words describing the concepts
NL Understanding is much harder than NL Generation. But, still both of them are hard.
History of NLP
Here, are important events in the history of Natural Language Processing:
1950- NLP started when Alan Turing published an article called "Machine and Intelligence."
1950- Attempts to automate translation between Russian and English
1960- The work of Chomsky and others on formal language theory and generative syntax
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30

Partial preview of the text

Download Natural Language Processing and more Study notes Natural Language Processing (NLP) in PDF only on Docsity!

NLP

Introduction

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of research and application that determines the way computers can be used to understand and manage natural language text or speech to do useful things. Natural language processing (NLP) can be defined as the automatic (or semi-automatic) processing of human language. The term ‘NLP’ is sometimes used rather more narrowly than that, often excluding information retrieval and sometimes even excluding machine translation. NLP is sometimes contrasted with ‘computational linguistics’, with NLP being thought of as more applied. Nowadays, alternative terms are often preferred, like ‘Language Technology’ or ‘Language Engineering’. Language is often used in contrast with speech (e.g., Speech and Language Technology). NLP is essentially multidisciplinary: it is closely related to linguistics. It also has links to research in cognitive science, psychology, philosophy and maths (especially logic). Within CS, it relates to formal language theory, compiler techniques, theorem proving, machine learning and human-computer interaction. Of course it is also related to AI, though nowadays it’s not generally thought of as part of AI.

Components of NLP

Natural Language Understanding

Taking some spoken/typed sentence and working out what it means

  • Mapping the given input in the natural language into a useful representation
  • Different level of analysis required: morphological analysis, syntactic analysis, semantic analysis, discourse analysis

Natural Language Generation

Taking some formal representation of what you want to say and working out a way to express it in a natural (human) language (e.g., English)

  • Producing output in the natural language from some internal representation
  • Different level of synthesis required: deep planning (what to say), syntactic generation
  • NLG is the process of constructing natural language outputs from non-linguistic inputs
  • NLG can be viewed as the reverse process of NL understanding
  • A NLG system may have two main parts: • Discourse Planner what will be generated, which sentences • Surface Realizer - realizes a sentence from its internal representation
  • Lexical Selection - selecting the correct words describing the concepts NL Understanding is much harder than NL Generation. But, still both of them are hard.

History of NLP

Here, are important events in the history of Natural Language Processing: 1950- NLP started when Alan Turing published an article called "Machine and Intelligence." 1950- Attempts to automate translation between Russian and English 1960- The work of Chomsky and others on formal language theory and generative syntax

1990- Probabilistic and data-driven models had become quite standard 2000- A Large amount of spoken and textual data become available Background Solving the language-related problems and others like them, is the main concern of the fields known as Natural Language Processing, Computational Linguistics, and Speech Recognition and Synthesis, which together we call Speech and Language Processing(SLP). Applications of language processing:

  1. spelling correction,
  2. grammar checking,
  3. information retrieval, and
  4. Machine translation. Generic NLP Many NLP applications can be adequately implemented with relatively shallow processing. For instance, spelling checking only requires a word list and simple morphology to be useful. The term ‘deep’ NLP for systems that build a meaning representation (or an elaborate syntactic representation), which is generally agreed to be required for applications such as email question answering and good MT. The most important principle in building a successful NLP system is modularity. NLP systems are often big software engineering projects — success requires that systems can be improved incrementally. The input to an NLP system could be speech or text. It could also be gesture (multimodal input or perhaps a Sign Language). The output might be non- linguistic, but most systems need to give some sort of feedback to the user, even if they are simply performing some action (issuing a ticket, paying a bill, etc). However, often the feedback can be very formulaic. There’s general agreement that the following system components can be described semi-independently, although assumptions about the detailed nature of the interfaces between them differ. Not all systems have all of these components:

1. Phonology : It deals with interpretation of speech sound within and across words. 2. Morphology : It is a study of the way words are built up from smaller meaning-bearing units called morphemes. For example, the word ‘fox’ has single morpheme while the word ‘cats’ have two morphemes ‘cat’ and morpheme ‘–s’ represents singular and plural concepts. Morphological lexicon is the list of stem and affixes together with basic information, whether the stem is a Noun stem or a Verb stem. 3. Syntax: It is a study of formal relationships between words. It is a study of: how words are clustered in classes in the form of Part-of Speech (POS), how they are grouped with their neighbors into phrases, and the way words depend on each other in a sentence. 4. Semantics: It is a study of the meaning of words that are associated with grammatical structure. It consists of two kinds of approaches: syntax-driven semantic analysis and semantic grammar. In discourse context, the level of NLP works with text longer than a sentence. There are two types of discourse- anaphora resolution and discourse/text structure recognition. Anaphora resolution is replacing of words such as pronouns. Discourse structure recognition determines the function of sentences in the text which adds meaningful representation of the text. 5. Reasoning: To produce an answer to a question which is not explicitly stored in a database; Natural Language Interface to Database (NLIDB) carries out reasoning based on data stored in the database. For example, consider the database that holds the academic information about student, and user posed a query such as: ‘Which student is likely to fail in Maths subject?’. To answer the query, NLIDB needs a domain expert to narrow down the reasoning process. Knowledge in language processing By speech and language processing, we have in mind those computational techniques that process spoken and written human language, as language. As we will see, this is an inclusive definition that encompasses everything from mundane applications such as word counting and automatic hyphenation, to cutting edge applications such as automated question answering on the Web, and real-time spoken language translation. What distinguishes these language processing applications from other data processing systems is their use of knowledge of language. Consider the Unix wc program, which is used to count the total number of bytes, words, and lines in a text file. When used to count bytes and lines, wc is an ordinary data processing application. However, when it is used to count the words in a file it requires knowledge about what it means to be a word , and thus becomes a language processing system. Of course, wc is an extremely simple system with an extremely limited and impoverished knowledge of language. To summarize, the knowledge of language needed to engage in complex language behavior can be separated into six distinct categories. 1. _Phonetics and Phonology – The study of linguistic sounds. 2. _ Morphology – The study of the meaningful components of words.

3. _Syntax – The study of the structural relationships between words. 4. _ Semantics – The study of meaning. 5. _Pragmatics – The study of how language is used to accomplish goals. 6. _Discourse – The study of linguistic units larger than a single utterance. Ambiguity in Natural language Natural Language Processing (NLP) is an area of research and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things. The Text based NLP has been regarded as consisting of various levels. They are: Lexical Analysis: - Analysis of word forms Syntactic Analysis: - Structure processing Semantic Analysis: - Meaning representation Discourse Analysis: - Processing of interrelated sentences Pragmatic Analysis: - The purposeful use of sentences in situations. Ambiguity can occur at all these levels. It is a property of linguistic expressions. If an expression (word/phrase/sentence) has more than one interpretation we can refer it as ambiguous. For eg: Consider the sentence, “The chicken is ready to eat” The interpretations in the above phrase can be, The chicken (bird) is ready to be fed or The chicken (food) is ready to be eaten. There are different types of ambiguities

Lexical Ambiguity : is the ambiguity of a single word. A word can be ambiguous with

respect to its syntactic class. Eg: book, study. For eg: The word silver can be used as a noun, an adjective, or a verb. She bagged two silver medals. She made a silver speech. His worries had silvered his hair. Lexical ambiguity can be resolved by Lexical category disambiguation i.e, parts-of-speech tagging. As many words may belong to more than one lexical category part-of-speech tagging is the process of assigning a part-of-speech or lexical category such as a noun, verb, pronoun, preposition, adverb, adjective etc. to each word in a sentence. Lexical Semantic Ambiguity : The type of lexical ambiguity, which occurs when a single word is associated with multiple senses. Eg: bank, pen, fast, bat, cricket etc. For eg: The tank was full of water. I saw a military tank. The occurrence of tank in both sentences corresponds to the syntactic category noun, but their meanings are different. Lexical Semantic ambiguity resolved using word sense disambiguation (WSD) techniques, where WSD aims at automatically assigning the meaning of the word in the context in a computational manner.

Syntactic Ambiguity : The structural ambiguities were syntactic ambiguities. Structural

ambiguity is of two kinds: Scope Ambiguity and Attachment Ambiguity. Scope Ambiguity : Scope ambiguity involves operators and quantifiers.

Stages in NLP Morphological and Lexical Analysis  The lexicon of a language is its vocabulary that includes its words and expressions  Morphology depicts analyzing, identifying and description of structure of words  Lexical analysis involves dividing a text into paragraphs, words and the sentences Syntactic Analysis  Syntax concerns the proper ordering of words and its effect on meaning  This involves analysis of the words in a sentence to depict the grammatical structure of the sentence  The words are transformed into structure that shows how the words are related to each other.  Eg. “the girl the go to the school”. This would definitely be rejected by the English syntactic analyzer  E.g. “Ravi apple eats” Semantic Analysis  Semantics concerns the (literal) meaning of words, phrases, and sentences  This abstracts the dictionary meaning or the exact meaning from context  The structures which are created by the syntactic analyzer are assigned meaning  E.g.. “colorless blue idea” .This would be rejected by the analyzer as colorless blue do not make any sense together  E.g. “Stone eat apple” Discourse Integration  Sense of the context  The meaning of any single sentence depends upon the sentences that precedes it and also invokes the meaning of the sentences that follow it  E.g. the word “it” in the sentence “she wanted it” depends upon the prior discourse context Pragmatic Analysis  Pragmatics concerns the overall communicative and social context and its effect on interpretation  It means abstracting or deriving the purposeful use of the language in situations  Importantly those aspects of language which require world knowledge  The main focus is on what was said is reinterpreted on what it actually means  E.g. “close the window?” should have been interpreted as a request rather than an order Challenges of NLP Some challenges that are faced in the machine learning process for NLP:

Breaking the sentence Formally referred to as “sentence boundary disambiguation”, this breaking process is no longer difficult to achieve, but is nonetheless, a critical process, especially in the case of highly unstructured data that includes structured information. A breaking application should be intelligent enough to separate paragraphs into their appropriate sentence units; however, highly complex data might not always be available in easily recognizable sentence forms. This data may exist in the form of tables, graphics, notations, page breaks, etc., which need to be appropriately processed for the machine to derive meanings in the same way a human would approach interpreting text. Tagging the parts of speech (POS) and generating dependency graphs People understand, to a greater or lesser degree; there is no need, other than for the formal study of that language, to further understand the individual parts of speech in a conversation or reading, as these have been learned in the past. In order for a machine to learn, it must understand formally, the fit of each word, i.e., how the word positions itself into the sentence, paragraph, document or corpus. In general, NLP applications employ a set of POS tagging tools that assign a POS tag to each word or symbol in a given text. Subsequently, the position of each word in a sentence is determined by a dependency graph, generated in the same procedure. Those POS tags can be further processed to create meaningful single or compound vocabulary terms. Building the appropriate vocabulary Using these POS tags and dependency graphs, a powerful vocabulary can be generated and subsequently interpreted by the machine in a way comparable to human understanding. Consider the following paragraph: “All employees are responsible for the management of risk, with the ultimate accountability residing with the Board. A comprehensive risk management framework is applied throughout the Group, with governance and corresponding risk management tools. This framework is underpinned by our risk culture and reinforced by the HSBC Values.” -HSBC annual report 2017 Sentences are generally simple enough to be parsed by a basic NLP program. But to be of real value, an algorithm should also be able to generate, at a minimum, the following vocabulary terms: Employees; Management of risk; Ultimate accountability; Board; Comprehensive risk management framework; Governance and corresponding risk management tools; Framework; Risk culture; HSBC values Unfortunately, most NLP software applications do not result in creating a sophisticated set of vocabulary. Linking different components of vocabulary Recently, new approaches have been developed that can execute the extraction of the linkage between any two vocabulary terms generated from the document (or “corpus”). Word2vec, a vector-space based model, assigns vectors to each word in a corpus, those vectors ultimately capture each word’s relationship to closely occurring words or set of words. But statistical

Several young companies are aiming to solve the problem of putting the unstructured data into a format that could be reusable for analysis. Historically, the same task has been done only manually by humans. Consider the following example that contains a named entity, an event, a financial element and its values under different time scales. “The recent developments in technology have enabled the stock price of Apple to rise by 20% to $168 as at Feb 20, 2018 from $140 in Q3 2017.” Think of this sentence broken down into the following structure: This is extremely challenging through linguistics. Not all sentences are written in a single fashion since authors follow their unique styles. While linguistics is an initial approach toward extracting the data elements from a document, it doesn’t stop there. The semantic layer that will understand the relationship between data elements and its values and surroundings have to be machine-trained too to suggest a modular output in a given format. Applications of NLP The following list is not complete, but useful systems have been built for: Spelling and grammar checking Optical character recognition (OCR) Screen readers for blind and partially sighted users Augmentative and alternative communication Machine aided translation (i.e., systems which help a human translator, e.g., by storing translations of phrases and providing online dictionaries integrated with word processors, etc) Lexicographers tools Information retrieval Document classification (filtering, routing) Document clustering Information extraction Question answering Summarization Text segmentation Report generation (possibly multilingual) Machine translation Natural language interfaces to databases Email understanding Dialogue systems Categorization Sentiment analysis Named entity recognition

Spelling and grammar checking : All spelling checkers can flag words which aren’t in a dictionary. (1) * The necessary steps are obvious. (2) The necessary steps are obvious. If the user can expand the dictionary, or if the language has complex productive morphology, then a simple list of words isn’t enough to do this and some morphological processing is needed. (3) More subtle cases involve words which are correct in isolation, but not in context. Syntax could sort some of these cases out. For instance, possessive its generally has to be immediately followed by a noun or by one or more adjectives which are immediately in front of a noun: Information retrieval : Information retrieval involves returning a set of documents in response to a user query: Internet search engines are a form of IR. However, one change from classical IR is that Internet search now uses techniques that rank documents according to how many links there are to them (e.g., Google’s PageRank) as well as the presence of search terms. Information extraction: Information extraction involves trying to discover specific information from a set of documents. The information required can be described as a template. For instance, for company joint ventures, the template might have slots for the companies, the dates, the products, the amount of money involved. The slot fillers are generally strings. Question answering: Question answering attempts to find a specific answer to a specific question from a set of documents, or at least a short piece of text that contains the answer. What is the capital of France? Paris has been the French capital for many centuries. There are some question-answering systems on the Web, but most use very basic techniques. For instance, Ask Jeeves relies on a fairly large staff of people who search the web to find pages which are answers to potential questions. The system performs very limited manipulation on the input to map to a known question. The same basic technique is used in many online help systems. Machine translation : MT work started in the US in the early fifties, concentrating on Russian to English. A prototype system was publicly demonstrated in 1954. MT funding got drastically cut in the US in the mid-60s and ceased to be academically respectable in some places, but Systran was providing useful translations by the late 60s. Systran is still going. Until the 80s, the utility of general purpose MT systems was severely limited by the fact that text was not available in electronic form: Systran used teams of skilled typists to input Russian documents. Systran and similar systems are not a substitute for human translation: they are useful because they allow people to get an idea of what a document is about, and maybe decide whether it is interesting enough to get translated properly. This is much more relevant now that documents etc are available on the Web. Bad translation is also, apparently, good enough for chatrooms. Spoken language translation is viable for limited domains: research systems include Verbmobil, SLT and CSTAR.

 First, inflectional morphemes never change the grammatical category (part of speech) of a word. For example, tall and taller are both adjectives. The inflectional morpheme -er (comparative marker) simply produces a different version of the adjective tall.  However, derivational morphemes often change the part of speech of a word. Thus, the verb read becomes the noun reader when we add the derivational morpheme -er.  For example, such derivational prefixes as re- and un- in English generally do not change the category of the word to which they are attached. Thus, both happy and unhappy are adjectives, and both fill and refill are verbs, for example. The derivational suffixes -hood and -dom, as in neighborhood and kingdom, are also the typical examples of derivational morphemes that do not change the grammatical category of a word to which they are attached.  Second, when a derivational suffix and an inflectional suffix are added to the same word, they always appear in a certain relative order within the word. That is, inflectional suffixes follow derivational suffixes. Thus, the derivational (-er) is added to read, then the inflectional (-s) is attached to produce readers.  Similarly, in organ - organize - organizes the inflectional -s comes after the derivational - ize. When an inflectional suffix is added to a verb, as with organizes, then we cannot add any further derivational suffixes. It is impossible to have a form like organizesable, with inflectional -s after derivational -able because inflectional morphemes occur outside derivational morphemes and attach to the base or stem.  A third point worth emphasizing is that certain derivational morphemes serve to create new base forms or new stems to which we can attach other derivational or inflectional affixes. For example, we use the derivational -atic to create adjectives from nouns, as in words like systematic and problematic. Inflectional affixes always have a regular meaning. Derivational affixes may have irregular meaning. If we consider an inflectional affix like the plural 's in word-forms like bicycles , dogs , shoes , tins , trees , and so on, the difference in meaning between the base and the affixed form is always the same: 'more than one'. If, however, we consider the change in meaning caused by a derivational affix like 'age in words like bandage , peerage , shortage , spillage , and so on, it is difficult to sort of any fixed change in meaning, or even a small set of meaning changes. Approaches to Morphology There are three principal approaches to morphology  Morpheme based morphology  Lexeme based morphology  Word based morphology Lemmatization “When we are running a search, we want to find relevant results not only for the exact expression we typed on the search bar, but also for the other possible forms of the words we used. For example, it’s very likely we will want to see results containing the form “shoot” if we have typed “shoots” in the search bar.”

This can be achieved through two possible methods: stemming and lemmatization. The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. However, these two methods are not exactly the same. Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why this approach presents some limitations. Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. How do they work?Stemming : there are different algorithms that can be used in the stemming process, but the most common in English is Porter stemmer. The rules contained in this algorithm are divided in five different phases numbered from 1 to 5. The purpose of these rules is to reduce the words to the root.  Lemmatization : the key to this methodology is linguistics. To extract the proper lemma, it is necessary to look at the morphological analysis of each word. This requires having dictionaries for every language to provide that kind of analysis. Regular expression, Regular expression, the standard notation for characterizing text sequences. The regular expression is used for specifying text strings in situations like this web-search example, and in other information retrieval applications, but also plays an important role in word- processing, computation of frequencies from corpora, and other such tasks. Once we defined regular expression, they can be implemented via finite-state automaton. The finite-state automaton is not only the mathematical device used to implement regular expressions, but also one of the most significant tools of computational linguistics. Variations of automata such as finite-state transducers, Hidden Markov Models, and N -gram grammars

Finite automata Suppose, to recognise dates (just day and month pairs) written in the format day/month. The day and the month may be expressed as one or two digits (e.g. 11/2, 1/12 etc). This format corresponds to the following simple FSA, where each character corresponds to one transition: Accept states are shown with a double circle. This is a non-deterministic FSA: for instance, an input starting with the digit 3 will move the FSA to both state 2 and state 3. This corresponds to a local ambiguity : i.e., one that will be resolved by subsequent context. By convention, there must be no `left over' characters when the system is in the final state. To make this a bit more interesting, suppose we want to recognise a comma-separated list of such dates. The FSA, shown below, now has a cycle and can accept a sequence of indefinite length (note that this is iteration and not full recursion, however). Both these FSAs will accept sequences which are not valid dates, such as 37/00. Conversely, if we use them to generate (random) dates, we will get some invalid output. In general, a system which generates output which is invalid is said to overgenerate. In fact, in many language applications, some amount of overgeneration can be tolerated, especially if we are only concerned with analysis. Finite state transducers (FST) FSAs can be used to recognise particular patterns, but don't, by themselves, allow for any analysis of word forms. Hence for morphology, we use finite state transducers (FSTs) which allow the surface structure to be mapped into the list of morphemes. FSTs are useful for both analysis and generation, since the mapping is bidirectional. This approach is known as two- level morphology. To illustrate two-level morphology, consider the following FST, which recognises the affix -s allowing for environments corresponding to the e-insertion spelling rule shown

Transducers map between two representations, so each transition corresponds to a pair of characters. As with the spelling rule, we use the special character "' to correspond to the empty character and^' to correspond to an affix boundary. The abbreviation other : other' means that any character not mentioned specifically in the FST maps to itself. As with the FSA example, we assume that the FST only accepts an input if the end of the input corresponds to an accept state (i.e., noleft-over' characters are allowed). For instance, with this FST, d o g s' maps tod o g. s', f o x e s' maps tof o x. s' and b u z z e s' maps tob u z z. s'. When the transducer is run in analysis mode, this means the system can detect an affix boundary. In generation mode, it can construct the correct string. This FST is non-deterministic. Similar FSTs can be written for the other spelling rules for English. Morphology systems are usually implemented so that there is one FST per spelling rule and these operate in parallel. One issue with this use of FSTs is that they do not allow for any internal structure of the word form. For instance, we can produce a set of FSTs which will result in unionised being mapped into un ^ ion ^ ise ^ ed , but as we've seen, the affixes actually have to be applied in the right order and this isn't modelled by the FSTs. Morphological parsing with FST FST is a type of FSA which maps between two sets of symbols It is a two-tape automation that recognizes or generates pairs of strings, one from each type.  FST defines relations between sets of strings.  FSAutomata have input labels. i.e. One input tape  FSTransducers have input:output pairs on labels. i.e. Two tapes: input and output.

  • Within each set, if more than one of the rules can apply, only the one with the longest matching suffix (S1) is followed. 1.Plural nouns / third person singular verbs (4 rules) sses  ss possesses  possess ies  i ties  ti
  1. Verbal past tense and progressives (3 rules) (v) ed  null walked  walk +cleanup rules to remove double letters, add back e’s at  ate conflat(ed)  conflate
  2. (v) Y  I eg. happy  happi
  3. Derivational morphology I: multiple suffixes ator  ate operator  operate fulness  ful gratefulness  grateful 5.Derivational morphology II: more multiple suffixes ful  null grateful  grate
  4. Derivational morphology III: single suffixes ous  null analogous  analog
  5. Cleanup (3 rules) (m>1) e  null probate  probat; raterate dropping double letters controlcontrol N –Grams- N-gram language model

Language Models

  • Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language.
  • For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful.
  • To specify a correct probability distribution, the probability of all sentences in a language must sum to 1.
  • A language model also supports predicting the completion of a sentence.
  • Please turn off your cell _____
  • Your program does not ______

N-Gram Models

  • Estimate probability of each word given prior context.
  • P(phone | Please turn off your cell)
  • Number of parameters required grows exponentially with the number of words of prior context.
  • An N-gram model uses only N 1 words of prior context.
  • Unigram: P(phone)
  • Bigram: P(phone | cell)
  • Trigram: P(phone | your cell)
  • The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model , the next state only depends on the k most recent states, therefore an N-gram model is a (N-1)-order Markov model.

N-Gram Model Formulas

  • Word sequences Wni = W 1 ……Wn
  • Chain rule of probability
  • Bigram approximation
  • N-gram approximation Estimating Probabilities
  • N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.
  • To have a consistent probabilistic model, append a unique start () and end () symbol to every sentence and treat these as additional words.
  • An N-gram model can be seen as a probabilistic automata for generating sentences. Initialize sentence with N-1 symbols Until is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N-1 words. Example: Let’s work through an example using a mini-corpus of three sentences I am Sam Sam I am I do not like green eggs and ham Here are the calculations for some of the bigram probabilities from this corpus N-gram for spelling correction.