









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Complete Notes. Enough for exams.
Typology: Lecture notes
1 / 49
This page cannot be seen from the preview
Don't miss anything!










































Natural language processing (NLP) could be defined as the attempt to get computers to process human languages in a textual form in a way that utilises knowledge of language in order to perform some useful task. NLP deals with human (natural) languages only, not artificial ones (e.g., C, predicate calculus, etc), and only when in a textual or graphical form - the related field of speech processing deals with spoken language. NLP concerns itself with tasks that require the application of some knowledge of human language - simpler tasks, such as string matching or generation of pre-stored text are not usually considered to be a part of NLP. There are two motivations for studying NLP: technological or engineering, that is, to build computer systems that can perform tasks that require understanding of textual language, and; linguistic or cognitive science, that is, to better understand how humans communicate using language. Some applications of NLP include machine translation, natural language database interfaces, information retrieval and extraction, automatic summarisation, question answering, text categorisation/clustering, plagiarism/authorship detection, natural language generation and dialogue systems. Information extraction, or text mining, is the recognition of entities and relations in text, and this recognised information is typically marked up (e.g., using XML) or mapped into a separate structured representation caleld a template. The extracted information can be used to generate summaries, for document search, navigation and browsing (i.e., the "semantic web") or for mining to discover trends or patterns. Another application is natural language generation (e.g., adaptive hypermedia), which uses single, abstract (i.e., non-natural language) descriptions of objects and the relations between them, and from such a representation, different textual descriptions are generated which takes account of various factors, such as target audience or target language.
To fully describe a language, we must consider the sounds which make up each word, the order in which the words occur, the meaning of these words and the types of situations in which a sentence can be uttered. Linguists typically make the assumption that language can be described on a number of levels, and that these levels can largely be studied independently. There is no full agreement as to what
these levels are, or how they are related, but the generally agreed levels of descriptions are: phonetics, phonology, graphetics, graphology, morphology, syntax, semantics, discourse and pragmatics. Phonetics is the study of how to describe and classify speech sounds - typically using some articulatory description (e.g., referring to how the sounds are produced in the human vocal tract). Phonetics is independent of the language of the speaker, concerned with the total range of vocal sounds humans can produce and detect. Phonology is the study of the principles that govern which sounds are used in actual human languages, and explaining those patterns and variations. Phonology seeks to find minimal sound units, called phonemes, which, if changed in a word, alter the word's meaning. By analogy to phonetics and phonology, graphetics is the study of the physical properties of the symbols making up writing systems, including study of the means of production of symbols, and graphology is the study of the characteristics of the symbol system used in human languages. Graphemes, analogous with phonemes, are the smallest unit in the writing system whose change affects meaning, e.g., c and c are the same grapheme. Although there are physical faculties associated with spoken language (Broca's area - a region of the brain shown to be linked to language, and vocal apparatus) appear to have arisen in homo sapiens over 300,000 years ago, written language only emerged more recently, about 5000 years ago. There is some dispute whether earlier evidence of writing forms a writing system. Some people define a writing system as "a system of more or less permanent marks used to represent an utterance in such a way that it can be recovered more or less exactly with the intervention of the utterer". Writing systems can be classified as logographic, where symbols represent words (e.g., Hanzi or Kanji); syllabic, where symbols represent syllables; and alphabetic, where symbols represent phonemes. Alphabetic systems can be broken up into abugidas, which are syllabic, but diacritics or other modifications are used to indicate which vowels is used; abjads, which are consonants only and the speaker inserts an appropriate vowel, which diacritics used to indicate non-standard vowels, and alphabets. Morphology is the study of the structure of words. The smallest meaningful elements into which words can be decomposed are called morphemes. Morphology is typically divided into the two subfields of inflections and derivational morphology. Inflectional morphology is concerned with the differing forms one words takes to signal differing grammatical roles (e.g., boy (singular) vs. boys (plural)); inflected forms are of the same syntactic category. Derivational morphology is concerned with how new words may be constructed from component morphemes (e.g., institute to institution to institutional, etc). Syntax is the study of the structure of sentences, however this is no consensus on the definition of a sentence. For NLP, we proceed by analysing linguistic constructions that actually occur and identifying those that can stand on their own and possess syntactic structure. Traditionally, a hierarchy of sentence structure is recognised: sentence → clauses → phrases → words →
The exact tokenisation rules that should be carried out is dependent on the orthographic conventions adopted in the writing system or script being processed. A script is the system of graphical marks employed to record the expressions of a language in a visible form. There is a many-to-many mapping between languages and scripts, as many languages may share one script (e.g., Western European languages onto Latin script), and some languages may have multiple scripts (e.g., Japanese's kanji and hiragana). Scripts may be classified along a number of dimensions: directionality (e.g., left-to-right, right- to-left, etc); historical derivation; and the relationship between the symbols and sounds. The latter is the most common way to classify scripts, but there is no consensus as to subclassifications. Non-phonological scripts can be: pictographic, where the graphemes (called pictographs or pictograms) provide recognisable pictures of entities in the world; ideographic, where the graphemes (called ideographs or ideograms) no longer directly represent things in the world, but have an abstract or conventional meaning, and may contain linguistic or phonological elements; hieroglypic, which contains ideograms and phonograms to create a rebus where the ideogram for a world is made up of phonograms for the consonants in the word (e.g., "bee" + "r" = "beer") and determinatives to disambiguate between multiple senses; logographic, where the graphemes (called logographs or logograms) represent words or part of words. Some characters derive from earlier ideograms, whereas others are phonetic elements, or modified version of base characters to indicate words of related meaning. Phonological scripts can be: syllabic - where graphemes correspond to spoken syllables, typically a vowel- consonant pair; alphabetic - where there is a direct correspondence between the graphemes and phonemes of the language. An ideal situation would be a one-to-one grapheme-phoneme mapping, but this is uncommon in natural language as the written form often does not keep pace with pronunciation. A character, or grapheme, is the smallest components of a written language that has a semantic value. A glyph is a representation of a character or character as they are rendered or displayed. A repertoire of glyphs makes up a font. The relationship between glyphs and characters can be complex, as for any character, there may be many glyphs (different fonts) representing it, and sometimes single glyphs may correspond to a number of characters, and perhaps a degree of arbitrariness, especially with diacritics.
When deciding on what set of characters are used for representing a language in a computer, it is important to separate out characters from glyphs: the underlying representation of a text should contain the character sequence only, and the final appearance of the text is the responsibility of the rendering process. Character encoding is necessary for the representation of texts in digital form, and standards are required if such data is to be widely shared - many standards have emerged over the years. The development of a single multilingual encoding standard will permit processing of multiple languages in one document, and the reuse of software for products dealing with multiple languages. Unicode is emerging as such a standard, but not without contention. Some of the key features of Unicode includes the use of 16-bit codes, extensible to 24-bits, with an 8-bit encoding (UTF-8) for backwards compatibility. Unicode has a commitment to encode plain text only, and attempts to unify characters across languages, whilst still supporting existing standards. Character property tables are used to associate additional information with characters, and the Unicode encoding model clearly separates the abstract character repertoire, their mapping to a set of integers, their encoding forms and byte serialisation. Regular expressions are often used to represent the rules to tokenise text, with finite state automata used for parsing. Finite state transducers are simple modifications to FSAs in which output symbols are associated with arcs, and for each transition, this output symbol is transmit (can be the empty symbol ε). Computational Morphology Morphology is the study of the forms of words - specifically, how they might be built up from or analysed into components which bear meaning, called morphemes. Morphology is productive, and applies to all words of a given class. It is inefficient to store all word forms in the lexicon as this information is repeatedly predictable, and for new words which behave as expected to be added to the lexicon easier. For more morphologically complex languages, such as Turkish, it is impossible to store all forms in the lexicon. For example, Turkish verbs have 40,000 forms, not counting derivational suffixes, which allow the number of forms to become theoretically infinite. Words are constructed from morphemes, which can be classified into 2 broad classes: stems (sometimes called the root form or lexeme), which are the main morpheme of a word, supplying the main meaning, and affixes, which supply additional meaning, or modify the meaning of the stem. Affixes can be further divided into prefixes, which precede the stem, suffixes, which follow the stem, infixes, which are inserted into the stem, and circumfixes, which both preceede and follow the stem.
These morphotactic FSAs can be combined with relevant sub-lexicons, expanding arcs like reg- noun with all the morphemes that make up a class to build morphological recognisers - programs that recognise strings that make up legitimate English words. However, we don't want to just recognise surface forms, we want to parse them yielding stem and morphological features. Thsi can be done using an FST, which views parsing and generation as a bi-directional mapping between the lexical level, representing the concatenation of morphemes (stem + morphological features) and the surface level (the final spelling of the word). This two-level morphology model was first proposed by Koskenniemi in 1983. We can think of an FST that maps between strings on a two-tape automaton, or as a two-tape automaton that recognises pairs of strings. In the two-level morphology model, the input symbol on an FST arc corresponds to a character on the lexical level, and the output symbol to a character on the surface level. Symbols commonly map to themselves directly, e.g., a:a, and these are called default pairs, and often compressed down to just the single letter "a". The FSAs and lexicons discussed above for morphological recognition can be extended for parsing by converting the morphotactic FSAs to FSTs by adding a lexical level tape and morphological features, and by allowing the lexical entries to have two levels. We often include ε here, as well as the special symbols ^ to indicate a morpheme-boundary and # to be the word- boundary. Additionally, @ is a wildcard symbol, with @:@ meaning "any feasible pair", allowing any possible suffixes. However we also need to consider orthographic rules where regular nouns undergo spelling changes at morpheme boundaries. We can model these rules by using FSTs that work on the output of the two-level model, treating it as including the morpheme boundary symbols as an intermediate level. We can use rule formalisms such as a → b / c __ d , which means "rewrite a as b when it occurs between c and d. For example, the rule for e-insertion may be: ε → e / [xsz]^__s#.
This yields a three level model: the lexical level (stem + morphological information, e.g., fox +N +PL), the intermediate level (including boundary information, e.g., fox^s#) and the surface level (the final word, e.g., foxes). The lexical FSTs can be combined with the orthographic rule FSTs to create a single parsing/generation system. These FSTs can be cascaded or combined into a single machine using composition and intersection operations. Language Modelling Natural language texts are not bags of words, word order is clearly essential in conveying meaning (although more important in some languages than other). Various approaches can be taken to studying/modelling/exploiting word order in language. One approach is to observe that words can be placed into classes (so called word classes, or parts-of- speech) that predict in what order they can be combined with words in other classes to form meaningful expressions. These allowable word orderings can be specified via grammar rules which state permissable sequences of word classes. Sets of such rules are called grammar, and the study of what word classes there are, what rules there are, and how these rules should be expressed is referred to as the study of syntax. Another approach is to build statistical models of the sequences of words or wordforms that actually occur, avoiding any abstraction into word classes and attempts to capture allowable word class sequences in grammar rules. Such word-based models are frequently termed language models, though grammars are also models of language. The debate between advocates of statistical word sequence modelling and advocates of syntax- based modelling has been at the centre of computational linguistics since the 1950s. The current view is that both should be exploited for their respective strengths.
If we suppose that we pause at a given word in a sentence, we can say that although it is impossible to know exactly which word will follow, we can say that some are more likely than others. Word sequence models allow us to assign probabilities to which words are likely to follow others in a sentence. These type of models are useful for speech recognition (deriving word sequences from phone sequences), augmentative communication systems, context-sensitive spelling correction, and word completion in limited input devices such as mobile phone keypads.
Setting k =1 yields a bigram model, and k -2 a trigram model. If we take the bigram case and substitute the approximation into the equation for computing sentence probability for n words, we get: P( wn 1 ) ≈ P( wk | wk -1) To train a bigram model using a corpus, we need for each bigram to count the number of times it occurs, and then divide it by the number of bigrams that share the first word, or the number of times the first word in the bigram occurs. For N-grams of any length, the more general form can be used: P( wn | wn -1 n -N+1) = C( wn -1 n - N+1 wn )/C( w n - n -N+1 wn ) This approach estimates the N-gram probability by dividing the observed frequency of a sequence by the observed frequency of its prefix - this ratio is called a relative frequency. Using relative frequencies to estimate probabilities is one example of Maximum Likelihood Estimation (MLE). The resulting parameter estimates maximise the likelihood of the corpus C, given the model M (i.e., maximises P(C|M)). Computing the products of bigram probabilities in a sentence frequently results in underflow. One solution is to take the log of each probability (the logprob), which can then be summed and the anti-log taken.
One issue with the bigram approach is that of data sparsity, where certain plausible bigrams do not occur within our training corpus, giving a probability for a particular bigram, and subsequently of the entire sentence, of 0. As something that has not yet been observed does not necessarily have a probability of 0, we need some way of estimating low probability events that does not assign them 0 probability. Smoothing is one method to overcome 0 occurrence bigrams. These approaches work by assigning some probability to bigrams unseen in the training data, sometimes including those who are seen very few times (and therefore have unreliable estimates). In this case, we must also reduce the probabilities of seen bigrams so that the total probability mass remains constant. Add-one Smoothing One simple approach is to add one to all the counts in the bigram matrix before normalising them into probabilities. This is known as add-one, or Laplace, smoothing. It does not work very well, but is useful to introduce smoothing concepts and provides a baseline. We can do this in the unigram case by modifying the unsmoothed MLE of the unigram probability for a word wi. To compute the modified count, c * i from the original count ci , then c * i = ( ci + 1)N/(N + V), where N is the size of corpus, and V is the size of the vocabulary (distinct word types).
These new counts can be turned into the smoothed probabilites, p * i by normalising by N, or alternatively directly from the counts: p * i = ( ci + 1)/(N + V). Another way to think about computing it is to define smoothing in terms of a discount dc , which is the ratio of dicounted counts to the original counts: dc = c / c. The intuition here is that non- zero counts need to be discounted in order to free up the probability mass that is re-assigned to the zero counts. For bigram smoothing, we can use a similar system: P( wn | wn -1) = (C( wn -1 wn ) + 1)/(C( wn -1) + V). Good-Turing Discounting There are a number of much better discounting algorithms than add-one smoothing. Several of them rely on the idea that you can use the count of things you've seen once to help estimate the count of thigns you've never seen. The Good-Turing algorithm is based on the insight that the amount of probability to be assigned to zero occurrence n -grams can be re-estimated by looking at the number of n -grams that occurred once. A word or n -gram (or indeed, any event) occurring once in a corpus is called a singleton or hapax legomenon. Good-Turing relies on determing N c , the number of n -grams with frequency c , or the frequency of frequency c. Formally, we can say N c = ∑ x :count( x )= c 1, i.e., when applied to bigrams, N 0 is the number of bigrams with count 0, N 1 the number of bigrams with count 1, etc. We can now replace the MLE count (N c ) with the Good-Turing smoothed count: c * = ( c + 1) × N c +1/N c. This count can be used to replace the MLE counts for all N i , and instead of using this to re- estimate a smoothed count for N 0 , then we can compute the missing mass (things with zero count) as PGT(things seen with frequency 0 in training) = N 1 /N, where N 1 is the count of things only seen once in training and N is the count of all things seen in training, and for things with frequency c > 0: PGT(things seen with frequency c in training) = c */N. Estimating how much probability to assign to each unseen n -gram assumes that N 0 is known. For bigrams, we can compute N 0 = V^2 - T, where V is the size of vocabulary, and T is the number of bigram types observed in the corpus. Another issue is that the re-estimate for c * is undefined when N c +1=0. One solution is to use Simple Good-Turing. N c is computed as before, but before computing c *, any zero N c counts are replaced with a value computed from a linear regression fit to map N c into log space: log(N c )
Discounting, which provides a way to assign probabilities to unseen events, is an integral part of the algorithm. In earlier discussions of Good-Turing, we assumed that all unseen events are equally probable and distributed probability mass available from discounting equally among them. In Katz backoff, Good-Turing discounting is used to calculate the total probability mass for unseen events, and backoff is used to determine how to distribute this amongst unseen events. Discounted probabilities P* and weights α must be used, otherwise the results of these equations are not true probabilities (that is, where the sum of all probabilities = 1). P* is used to compute the amount of probability mass to save for lower order n -grams, and the α is used to ensure the amount of probability mass contributed by the lower-order model sums to the amount saved by discounting higher order N-grams.
The closed vocabulary assumption is the assumption that we know the vocabulary size V in advance. Language models are then trained and tested on corpora containing only words from the closed vocabulary. This assumption is unrealistic, as real world text always contains previously unseen words (new proper names, neologisms, new terminology, etc). These previously unseen words are called unknown, or out-of-vocabulary (OOV) words. The percentage of OOV words in a test set is the OOV rate. Generally, an open vocabulary system is more desirable, as it can handle OOV words. A typical approach to model such unknown words in the test set is by adding a pseudo-word "UNK" to the vocabulary, and training the probabilities to model it by fixing a vocabulary in advance, and converting an OOV word in the training set to UNK prior to training. The counts for UNK can then be estimated just as for any other word in the training data.
The best way to evaluate an n -gram language model is to embed the model in an application (e.g., a speech recogniser) and see if it makes a difference. However, such evaluation can be difficult. An alternate, more direct, measure is perplexity. Although an improvement in perplexity does not guarantee improvements in applications, it is often correlated with such improvements. The intuition underlying perplexity as an evaluation measure is that it is a measure of how well a model fits some test data. This measure of fit is the probability of the test data, given some model, and the model that better fits the data is the one that assigns it a higher probability. For a test set W = w 1 w 2 ... wn , perplexity (PP) is formally defined as the probability of the test set, normalised by the number of words: PP(W) = P( w 1 w 2 ... wn )-1/N
If we make our conditional independence assumption and use the bigram case, then the chain rule of probability allows us to expand this to: PP(W) = N√∏N i =11/P( wi | wi -1) Note that perplexity is inversely related to the conditional probability of the word sequence (i.e., minimising perplexity maximises the test set probability, according to the language model). In evaluating a language model, it is usual to use the entire word sequence of some test corpus. Another way to think of perplexity is as the average weighted branching factor (the number of possible next words for a given word) of a language. Where the branching factor is where each word following is equally likely, the weighted branching factor is the one which takes into account the likeliness of the following word. When computing the perplexity, it is important that the test set is distinct from the training set, as knowledge of the test set (e.g., of the vocabulary) can artificially lower the perplexity. Perplexity can also be defined in terms of cross-entropy. If we recall that entropy of a random variable X over a set χ (e.g., letters or words) is defined by: H(X) = -∑ x ∈χ p ( x )log 2 p ( x ), where p ( x ) = P(X= x ). However, we are interested in computing the entropy of finite word sequences of a language L, so our X is a word sequence, and χ the language. From this, we can define the entropy rate, or per-word entropy: 1/ n H(W n 1 ) = -1/ n ∑W n 1 ∈ L p (W n 1 ) log 2 p (W n 1 ) To measure true entropy of a language, we need to think about sequences of infinite length. If we consider language as a stochastic process, L, that generates a sequence of words: H(L) = lim n →∞ 1/ n ∑W∈L p ( w 1 , w 2 , ..., wn )log p ( w 1 , w 2 , ..., wn ) Shannon-McMillan-Breiman Theorem states that if the language is regular (stationary and ergodic): H(L) = lim n →∞ -1/ n log p ( w 1 , w 2 , ..., wn ), i.e., take a single sequence that is long enough instead of summing over all possible sequences. Intuition is that a long enough sequence of words will contain in it many other shorter sequences, and that each of these shorter sequences will reoccur in the longer sequence according to their probabilities. Cross-entropy is useful when we don't know the true probability distribution p that generated some data, but do have a model m that approximates it. For a distribution p , and our attempt to model it, m , the cross entropy of m on p is defined as: H( p , m ) = lim n →∞ 1/ n ∑W∈L p ( w 1 , w 2 , ..., wn )log m ( w 1 , w 2 , ..., wn ), i.e., we draw sequences according to p , but sum the log of their probabilities according to m. If we apply the Shannon-McMillan-Breiman theorem for stationary ergodic processes: H( p , m ) = lim n →∞ -1/ n log m ( w 1 , w 2 , ..., wn ) removes the dependency on the unknown true dependency p. For any model m , H( p ) ≤ H( p , m ), i.e., cross-entropy is an upper bound on entropy. The better m is, the closer the cross-entropy is to the true entropy H( p ), so cross-entropy can be used to compare two models, and the better model is the one with the lower cross-entropy.
constancy (local measures of entropy which ignore previous discourse context should increase with sentence number) and correlation has been demonstrated between sentence entropy and processing effort it causes in comprehension, as measured by reading times in eye-tracking data.
Another approach is that we can observe that words are placed into classes (word classes, or parts-of-speech), that predict in what order they can combine with words in other classes to form meaningful expressions. Allowable word orderings can be specificed via grammar rules which state permissible sequences of word classes. Sets of these rules are called a grammar, and study of what word classes there are, what rules there are and how these rules should be expressed is referred to as the study of syntax. Word classes are familiar concepts to most, and have been used in the study of language since then time of the Ancient Greeks. Modern catalogues of word classes range from 50 to 150, depending on the level of granuality. The current view is that both statistical language modelling and syntax should be exploited for their respective strengths. An important distinction between classes is that of open and closed word classes. Closed classes are those for which new words seldom appear (e.g., prepositions, articles, etc) and open classes are those for which new words frequently appear (e.g., nouns, verb, adjectives, etc). Word classes may be defined in terms of syntactic and morphological properties (words that occur in the same constructions or take the same affixes are in the same class) or semantic properties (words that perform the same semantic functions are in the same class), and syntactic or morphological distributional regularities tend to be favoured over semantic regularities as a basis for defining word classes. If we consider semantic function, then words fall into the same class if they perform the same semantic function, e.g., computer may be a noun as it performs the semantic function of identifying an object, but in the phrase "Computer Science", computer qualifies science (like an adjective). For distributional regularities, we class words into the same class if they can appear in the same constructions, e.g., it was very [new/black]. Most nouns can functions as modifiers in some context, but there are contexts in which adjectives can occur but where nouns can not. Further, some adjectives (like black) can functions as nouns when they name a quality directly, but many can not do so. Word classes are useful because they give us information about a word and its neighbours - a word class may determine pronunciation, and suggests which morphological affixes it can take. Certain word classes are more useful than others in suggesting a document's content (e.g., nouns).
Shallow parsing can be accomplished using patterns of POS tags (e.g., ADJ NOUN NOUN for terminology extraction, or Mr. PNOUN PNOUN for person name reginisers). Similar patterns are useful for extraction other named entities, dates and times. Word sense disambiguation is assisted by knowing the POS, and the word class of the current word helps predict the word class of the next (e.g., possessive pronouns are likely to be followed by a noun, and personal pronouns by a verb). This can be exploited in language models for speech recognition. Parsers typically require word class information to discover constituency structure, and POS- tagged corpora are useful in linguistic research (e.g., to find frequencies of particular constructions).
Part-of-speech tagging is critically dependent on the inventory of POS tags, and a number of POS tagets have been developed, typically in conjunction with corpora, e.g., the Brown tagset (87-tags) for the Brown corpus; the Penn Treebank tagset (45-tags) for use with the Penn Treebank; the C5 tagset (61-tags) and C7 tagset (146-tag) used for the British National Corpus. POS tagging would be easy if there was a one-to-one mapping from words to POS tags, however this is not the case - many words have multiple POS tags (i.e., they are ambiguous with respect to POS tag). In some cases, context makes it easy to decide which tag is appropriate for an ambiguous words; in other cases, it is not so easy. Some well-known problem cases include preposition/particle/adverb overlab (words like around can be all three); labelling noun modifiers; simple past/past participle/adjective overlap, where verb forms ending -ed can function to indicate past occurrence of an event (either simple past or past participle) or to express a property. POS tagging is difficult, as although most words in English only have one tag, the most common words have multiple tags, e.g., in the Brown corpus, only 11.5% of word types are ambiguous, but >40% of word tokens are. For automated POS tagging, we give a sequence of words that are tokenised and split into sentences, and a tagset, and then for output, we expect the output of the same sequence of words with a single best tag added to each word. Tagging algorithms may be classified among a number of dimensions: rule-based vs. stochastic/probabilitic, and hand-crafted vs. machine learning. Rule-based algorithms apply a set of rules to tag the current word - rule antecedents typically test the word and tag/possible tags of the current word, using a window of surrounding words. Stochastic/probabilistic approaches apply a probabilistic model to compute the probability of the current word having a given tag in the current context. Hand-crafted approaches use rules written
Transformation-based tagging Transformation-based POS tagging is an instance of transformation-based learning (TBL), a general machine learning approach introduced by Eric Brill. Rules are used to assign tags to words, but rather than these rules being manually written, they are induced from a corpus of labelled examples - a supervised learning approach. With TBL, first all words are tagged with the most probable tags (derived using MLE on the corpus), and then a sequence of corrective rules are applied whihc have been learnt by minimising error over a training corpus. A methaphor for this is painting a picture - you start with the background and then "correct" it by progressively overwriting it by smaller features. The TBL algorithm has three major stages. Given an unlabelled copy of a corpus for which a known correct tagging exists ("truth"), then first, every word is labelled with the most likely tag. Then, the best transformation rule is chosen from all possible transformation rules, and the data re-labeled according to this best transformation rule. Steps 2 and 3 are repeated until no significant improvement results. The metric used to determine the "best" transformation is the one that most improves the number of correct taggings (by comparison with "truth"). To constrain the number of possible transformations, transformations must be instantiations of one of a small number of abstracted transformations, called templates. e.g., "Change tag a to b when the preceeding/following word is tagged z ". The variables a , b , z and w range over all POS tags. Using these constraints, all possible enumerations of all possible templates are generated to consider which performs "best". Evaluation of taggers is typically done against some hand-annotated gold standard using 10-fold crossvalidation. Most current annotators can achieve 96-97% using simple tagsets; a similar performance to human annotators. This is considered by some to be an upper-bound (ceiling) on tagger performance. A baseline can also be established, which is usually the unigram most-likely tag. The baseline accuracy is typically around 90-91%. Syntax We recognise that some word strings are well-formed, or grammatical, and others are not. Additionally, a sentence may be grammatical, but not acceptable, and acceptability is affected by performance factors (e.g., multiple centre embeddings, or so-called garden path sentences). Grammaticality is independent of semantic appropriateness - some sentences, such as colourless green ideas sleep furiously, are grammatical, even though nonsensical.
This task of explaining why some strings are grammatical and others are not is addressed by the study of syntactic theory, which is "the study of the principles governing how words are combined to form sentences". In the generative paradigm of syntactic research, this goal is pursued via the specification of formal grammars. These formal, or generative, grammars are abstract systems of rules and principles that formally characterise the properties of the well-formed sentences of a language, distinguishing them from ungrammatical strings. Such a grammar is said to generate the language. Syntax plays a vital role in making language useful for communication. Syntax not only limits possible sentence forms, but also limits the possible sentence forms for a given meaning (e.g., word order in English plays a crucial role in allowing us to determine the precise relationship between word meanings, although other languages use different mechanisms for signalling relations, but a mechanism is always required). A strong view is that the syntactic structure of any sentence provides a procedure for computing the meaning of the sentence from the meaning of its parts.
A phrase structure grammar (PSG) has two key components: a lexicon and phrase structure rules. For the lexicon, words are categorised into word classes (aka, parts-of-speech, form classes or lexical categories). The lexicon (or dictionary) is then supplied as part of a PSG and provides an assignment of lexical categories to words. For the phrase structure rules, words are grouped into larger units called phrases or constituents, which in turn may group into larger phrases. These phrases are marked with a phrasal category, and the ways words and phrases may group into larger phrases is stated by the phrase structure rules (aka rewrite, or context-free rules), e.g., NP → Det N. A sentence structure is frequently displayed in the form of a phrase structure tree: