Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Computational linguistics Lectures notes 1 - C. Chesi, Appunti di Linguistica

Computational Linguistics Lecture notes (professor Cristiano Chesi 2018) University of Siena - Master degree "language and Mind - Linguistics and Cognitive Studies" Natural Language processing (NLP) Speech recognition Corpus Linguistics Annotated and Unannotated Corpora +Lab instructions: Exploring Unannotated Corpora (Frequencies and RE)

Tipologia: Appunti

2018/2019

Caricato il 05/08/2019

claudia-ruzza
claudia-ruzza 🇮🇹

4.3

(8)

23 documenti

1 / 28

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
Computational Linguistics A.Y. 2018/19 - C. Chesi
INTRODUCTION TO NATURAL LANGUAGE PROCESSING (NLP)
AND LINGUISTIC COMPUTATION
04.10.18
General Information:
Goals
●Deep understanding of what’s needed for fully describing a natural language
●What’s a corpus and how it can be used
●How linguistic data can be (semi)automatically processed
●Be independent in reading advanced papers in this field
Teaching
●Lectures (lecture notes and course information will be available at: http://www.ciscl.unisi.it/maste
r/materials.htm)
●Labs
Evaluation
●Class participation (20% of final grade)
●Project presentation (40% of final grade)
●Oral exam (40% of final grade) on course topics (see References)
References
Essential
●Jurafsky, D. & Martin, J. H. (2009) Speech and Language Processing. Prentice‐Hall. 2nded.
http://www.cs.colorado.edu/~martin/slp.html chapters: 1, 2, 3, 4, 5, 12, 13, 14, 16, 17, 18, 19, 20
Extended
●Advanced readings will be presented at each lecture.
Those readings won’t be included at the oral exam, but they should help you
in shaping your project and better understanding various aspects of NLP and LC
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c

Anteprima parziale del testo

Scarica Computational linguistics Lectures notes 1 - C. Chesi e più Appunti in PDF di Linguistica solo su Docsity!

Computational Linguistics A.Y. 2018 /1 9 - C. Chesi INTRODUCTION TO NATURAL LANGUAGE PROCESSING (NLP) AND LINGUISTIC COMPUTATION 04.10. General Information: Goals ●Deep understanding of what’s needed for fully describing a natural language ●What’s a corpus and how it can be used ●How linguistic data can be (semi)automatically processed ●Be independent in reading advanced papers in this field Teaching ●Lectures (lecture notes and course information will be available at: http://www.ciscl.unisi.it/maste r/materials.htm) ●Labs Evaluation ●Class participation (20% of final grade) ●Project presentation (40% of final grade) ●Oral exam (40% of final grade) on course topics (see References) References Essential ●Jurafsky, D. & Martin, J. H. (2009) Speech and Language Processing. Prentice‐Hall. 2 nded. http://www.cs.colorado.edu/~martin/slp.html chapters: 1, 2, 3, 4, 5, 12, 13, 14, 16, 17, 18, 19, 20 Extended ●Advanced readings will be presented at each lecture. Those readings won’t be included at the oral exam, but they should help you in shaping your project and better understanding various aspects of NLP and LC

NLP: Natural Language Processing You give a linguistic imput and receive a reply (ok google; siri; etc.) You find a specific pattern to extract a specific information. PARSING Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech). The term has slightly different meanings in different branches of linguistics and computer science. Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as sentence diagrams. It usually emphasizes the importance of grammatical divisions such as subject and predicate. Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information. The term is also used in psycholinguistics when describing language comprehension. In this context, parsing refers to the way that human beings analyze a sentence or phrase (in spoken language or text) "in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc." This term is especially common when discussing what linguistic cues help speakers to interpret garden-path sentences. Within computer science, the term is used in the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of compilers and interpreters. The term may also be used to describe a split or separation. Writing a grammar completely understandable by a computer. (COMPETENCE MODELS such as X’ module)

Wolfram Alpha (Wolfram Co., 2016) question: “what's the size of an apple?” WA answer: apple | maximum recorded trunk diameter > (data not available) What HAL 9000 should have known Speech recognition / synthesis Analysis/production of speech signal Formants identification Syllabification Word Segmentation Prosodic contours Natural language understanding / generation ●Morphology: dogs = dog + s ●Syntax: [[the boy] [eats [an apple]]]) ●Semantics: what does a word mean? And a sentence? ●Pragmatics: Is there a communicative intention beyond the literal meaning? ●Discourse: Interpreting phrases across sentences ●Information Extraction / Retrieval ●Inference … HAL 9000 is a fictional character and the main antagonist in Arthur C. Clarke's Space Odyssey series. First appearing in 2001: A Space Odyssey, HAL (Heuristically programmed ALgorithmic computer) is a sentient computer (or artificial general intelligence) that controls the systems of the Discovery One spacecraft and interacts with the ship's astronaut crew. Wolfram Alpha (also styled WolframAlpha, and Wolfram|Alpha) is a computational knowledge engine, or answer engine, developed by Wolfram Alpha LLC. It is an online service that answers factual queries directly by computing the answer from externally sourced "curated data", rather than providing a list of documents or web pages that might contain the answer as a search engine might. This device transforms the audio signals into something the computer can understand: an high frequency indicates an high pitch. The graph below represents the hight of the pitch with dark areas. The yellow line stands for the melody/intonation and the blue line represents the fundamental frequencies, showing the lowest possible frequency for that sound. Hal 9000 has not the ability to correctly elaborate the audio imput. The difficult part in language processing machines is to provide them with an algorithm that enable them to understand the audio signals.

A naïf plot of NLP applications What can we do now? Word Processing ●Syllabification (extremely good) Linguistics > Lin‐guis‐tics ●Spell‐checking (good) houze> house (T9, Swipe…) ●Grammar‐checking (bad) John sing > John sings ●Stylistic correction (bad) the nail gets removed from the board-the nail is removed from the board ●Synonyms, Opposites thesaurus ●Single word translation Human Computer Interaction ●Speech recognition (Apple Siri, o Microsoft Cortana, Google Now) /kasa/ > casa ●Pseudo‐comprehension Where could I find a Chinese restaurant nearby? > [opening map with precise] ●Natural language generation [previous context] > I’m opening your calendar… ●Information filtering/retrieval/extraction “these days China Stock Exchange collapsed” > entity: Beijing Stock Exchange status: lowering; time: end of October 2015

  1. Machine translation (sufficiently good, but…) Google Translator : Tanto va la gatta al lardo che ci lascia lo zampino So the cat goes to the fat that it leaves its handle (April 2013) The pitcher goes so much that it leaves its handle (February 2014, October 2015) The pitcher goes so that it leaves its handle (October 2016) “the pitcher goes so often to the well that it leaves its handle” it's raining cats and dogs piove cani e gatti (April 2013, February 2014) piove a catinelle (October 2015) piove a secchiate (October 2016) la vecchia legge la regola the old law rule (April 2013, February 2014)

!! Representation of the linguistic problem Competence (data‐structure) ●What kind of information structure do we need? A word can start by wo...(word) but not by wb... The sin “sings” is different from the one in “roses” The roses are beautiful Vs.*“the are beautiful roses” “The cat chases the dog” > subj: cat(agent); verb: chase(action); obj: dog(patient) the television chases the cat “the houses” Vs. “some house” ●Specification of primitives and features at every level: Phonemes – distinctive features… Speech recognition

Phillips C. et al. (2000) Auditory Cortex Accesses Phonological Categories: An MEG Mismatch Study. Journal of Cognitive Neuroscience Speech recognition !! Representation of the linguistic problem Competence (data‐structure) ●Primitives: Phonemes – distinctive features Morphemes – combinatorial rules Words – significant morphemes bundles Phrases – natural groups of words Thematic roles – agent, patient ... Processing (competence in use) ●Combinatorial principles; how do we use primitives components of competence: Phonological level – phonotactic restrictions Morphological level – inflectional (say > says / said) and derivational rules (easy > easily) ●Processing is not just performance (that is the use of competence under resources limitation) An historic example: Sutra for Sanskrit Panini (400..600AC): 1700 base elements divided in classes (e.g. nouns, verbs ecc.)

  • combination rules (about 4000) > Sanskrit description. Speech recognition is the inter- disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

CORPUS LINGUISTICS

Essential references ●Jurafsky, D. & Martin, J. H. (2000) Speech and Language Processing. Prentice‐Hall. http://www.cs.colorado.edu/~martin/slp.html (ch.2, 3 and 4… not directly related to this class, but useful for the next two lectures) Extended references ●Kennedy, G. Leech, G. & Short, M.(1998)An introduction to corpus linguistics. London:Longman ●Manning & Schütze (1999) Foundations of statistical natural language processing. MIT press. ●Lazzari, Bianchi, Cadei, Chesi e Maffei (2010) Informatica umanistica. McGraw‐Hill (capitolo4) https://www.academia.edu/1836987/Informatica_umanistica ●Lenci, Montemagni e Pirrelli(2006) Testo e computer. Carocci In linguistics, a CORPUS (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. There are two main types of parallel corpora which contain texts in two languages. In a translation corpus, the texts in one language are translations of texts in the other language. In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS- tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics. Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. Corpora can be considered as a type of foreign language writing aid as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.

Trends Hot Topics Google Trends Twitter word cloud Big Data & Machine Learning Machine learning approaches to massive linguist corpora, also known as “big data” Google Trends is a website by Google that analyses the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time. Google trends gives us, according to a certain phrase, the trend of its usage. The phrases keep us the knowledge shared and available across the time. A tag cloud (word cloud, or weighted list in visual design) is a novelty visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or colour. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence. When used as website navigation aids, the terms are hyperlinked to items associated with the tag. Big Data is all the data related to our browsing history try to find out patterns. Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction

Corpus Linguistics (Bloomfield, Harris) CORPORA: WHY WE NEED THEM

  • Linguistic documentation: ecological linguistic data sources
  • Creation of dictionaries and grammars
  • Language models based on frequencies and distributions
  • Linguistic benchmark (for NLP tools) Corpora: classification…
  • Genericity specialist (or vertical) vs. general (horizontal) General uses standard language and is directed to a general audience – es. Newspaper Specific uses specific language, understandable only by groups of people – es. Laws
  • Modality written vs. spoken vs. mixed
  • Time synchronous vs. diachronic
  • Language mono vs. multilingual
  • Integrity full texts vs. partial texts
  • Coding level of annotation …other properties the big a corpus is, the better it is
  • Extension «there is no data like more data» (Manning & Schütze1999) … but focusing only on dimension does not always pay you back (Leech 1991:10)
  • Representatively Web corpus… (Google battles… noise…)
  • Closed corpora, monitoring corpora Corpus linguistics is the study of language as expressed in corpora (bodies) of "real world" text. The text- corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language and explores how that language relates to other languages. Originally derived manually, corpora now are automatically derived from source texts.
  • It makes information retrieval and extraction easier, faster and enables human analysts to exploit and retrieve analyses of which they are not themselves capable
  • Annotated corpora are reusable resources
  • Annotated corpora are multifunctional - they can be annotated with a purpose and be reused with another
  • Corpus annotation records a linguistic analysis explicitly
  • Corpus annotation provides a standard reference resource, a stable base of linguistic analyses, so that successive studies can be compared and contrasted on a common basis Index tells us where the words are located in a text, therefore are very useful in philological studies to find the context and the recurrence of a word. In 1946 Roberto Busa planned the INDEX THOMISTICUS, as a tool for performing text searches within the massive corpus of Aquinas's works. In 1949 he met with Thomas J. Watson, the founder of IBM, and was able to persuade him to sponsor the Index Thomisticus. The project lasted about 30 years, and eventually produced in the 1970s the 56 printed volumes of the Index Thomisticus, which is known to be the first great publishing project to be printed with the new technology.

CORPUS are finite collections of linguistic information. Linguistic analyses encoded in the corpus data itself are usually called corpus annotation , defined by Leech (1997) as “The process of adding […] interpretive, linguistic information to an electronic corpus of spoken and/or written language data” For example, we may wish to annotate a corpus to show parts of speech, assigning to each word a grammatical category label. So when we see the word talk in the sentence I heard John's talk and it was the same old thing, we would assign it the category noun in that context. This would often be done using some mnemonic code or tag such as N. While the phrase corpus annotation may be unfamiliar, the basic operation it describes is not – it is just like the analyses of data that have been done using hand, eye, and pen for decades. For example, in Chomsky (1965), 24 invented sentences are analysed; in the parsed version of LOB, a million words are annotated with parse trees. So corpus annotation is largely the process of recording common analysis in a systematic and accessible form. There are two corpus typologies: Annotated and Unannotated Corpus Unannotated corpus is a simple plain or a raw text, where the linguistic information is implicit. Annotated text is no longer just a text, it is a real repository of linguistic information, which are therefore explicated. Annotation can be:

  • Automatic ⎯ Can be automated reliably for some types (POS, lemmatization) ⎯ Can annotate large amount of data quickly at low cost ⎯ Post-editing or human correction may be necessary to improve accuracy
  • Computer-assisted annotation ⎯ The semi-automatic annotation process (human-machine interface) may produce more reliable results than fully automated annotation, but it is also slower and more costly
  • Manual annotation ⎯ Occurs where no annotation tool is available or where the accuracy of available systems is not high enough to be useful ⎯ Expensive and time-consuming, typically only feasible for small corpora

? How big should be a corpus to be considered representative?

For the first-generation corpus, at list 1 million tokens (words)

Type A: face to face conversation (e.g. homebased conversations, workplace conversations, school conversation…) Type B: bidirectional mediated conversation (telephone conversations…) … ●Reference: De Mauro T., F. Mancini, M., Vedovelli, M. Voghera (1993) Lessico di frequenza dell'italiano parlato, Milano, Etas libri Before using a corpus… ●Text Normalization Il sig. P. Pallino rappresentato e difeso dall'avv. Mario Rossi, notifica, ex‐ art.150 C.P.C., aglieredie/od aventicausa di Gianni Bianchi, natoa CastelnuovoV.C. (PI) il 1°aprile1908 e deceduto in Sassari (SA) l’11 aprile2008, presso il Tribunale di Blablabla Sez. Distaccata, l'atto di sostituzione della locuzione' Figura in Catasto allapartita terreni953, foglioXI, mappale335, are 1,59' con la seguente locuzione: 'figurain Catastoallapartita terreni953, fogliX‐XI, mappale325, are 00,96'. sig. > signore (o signora?) ●Tokenization What’s a word/token? (spaces, punctuation, quotes, subscripts, numbers...) ●Lemmatization Bello for bello, belli, bella, belle ... Using an (un‐)annotated corpus Ambiguities the case of “in” preposition in Italian (http://www.treccani.it/) KeyWord in Context (KWIC)

Frequency lexicon (from «LessicoElementare», Zanichelli, 1994) Type/Token Ratio (TTR) richness of vocabulary, calculated by dividing forms(types) by occurrences (tokens). The value goes from 0(low richness) to 1(high form variety) Frequency lexicon (from «LessicoElementare», Zanichelli, 1994) (F. = frequency, D. = Dispersion) Trivia: Matt Daniels hip-hop corpus (^) Designer, coder, and data scientist Matt Daniels recently sat down to figure out once and for all which rapper has the biggest vocabulary. Matt decided that looking at the first 35,000 words delivered by various artists was a good indicator of who’s lexicon is indeed the broadest. Shakespeare’s vocabulary covers 28,829 words across his entire body of work, suggesting he knew over 100,000 words, whereas average people have a vocab of 5,000 words.

TANL(Text Analytics and Natural Language, Attardi e Simi 2009) The Italian Tanl Server is a Web service for analyzing Italian texts using the Tanl pipeline. The pipeline can be run up to various stages: Tokenize : split the input into tokens POS : tag input with POS NER : extract Named Entities SST : extract SuperSense Entities Parser : produce parse trees

POS Tagging XML annotation Inclusion indicates constituents: ●Parentheses [[A] [B C]] ●HTML

123 Mario Rossi

●XML 123 Mario Rossi Using annotated corpora Grammar extraction Benchmark for POS Tagging & Parsing tools Linguistic studies: frequencies of forms and syntactic patterns (retrieved/counted using specific que ries) Using semi‐structured corpora CHILDES (MacWhinney& Snow, 1985) (Child Language Data Exchange System) is an archive of spontaneous speech transcription between children and adults (each transcription is about 20‐60 minutes long). http://childes.psy.cmu.edu more than 130 corpora, 1500 published articles… CHAT coding sample In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS- tagging algorithms fall into two distinctive groups: rule- based and stochastic.