




















Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Computational Linguistics Lecture notes (professor Cristiano Chesi 2018) University of Siena - Master degree "language and Mind - Linguistics and Cognitive Studies" Natural Language processing (NLP) Speech recognition Corpus Linguistics Annotated and Unannotated Corpora +Lab instructions: Exploring Unannotated Corpora (Frequencies and RE)
Tipologia: Appunti
1 / 28
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!





















Computational Linguistics A.Y. 2018 /1 9 - C. Chesi INTRODUCTION TO NATURAL LANGUAGE PROCESSING (NLP) AND LINGUISTIC COMPUTATION 04.10. General Information: Goals ●Deep understanding of what’s needed for fully describing a natural language ●What’s a corpus and how it can be used ●How linguistic data can be (semi)automatically processed ●Be independent in reading advanced papers in this field Teaching ●Lectures (lecture notes and course information will be available at: http://www.ciscl.unisi.it/maste r/materials.htm) ●Labs Evaluation ●Class participation (20% of final grade) ●Project presentation (40% of final grade) ●Oral exam (40% of final grade) on course topics (see References) References Essential ●Jurafsky, D. & Martin, J. H. (2009) Speech and Language Processing. Prentice‐Hall. 2 nded. http://www.cs.colorado.edu/~martin/slp.html chapters: 1, 2, 3, 4, 5, 12, 13, 14, 16, 17, 18, 19, 20 Extended ●Advanced readings will be presented at each lecture. Those readings won’t be included at the oral exam, but they should help you in shaping your project and better understanding various aspects of NLP and LC
NLP: Natural Language Processing You give a linguistic imput and receive a reply (ok google; siri; etc.) You find a specific pattern to extract a specific information. PARSING Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech). The term has slightly different meanings in different branches of linguistics and computer science. Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as sentence diagrams. It usually emphasizes the importance of grammatical divisions such as subject and predicate. Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information. The term is also used in psycholinguistics when describing language comprehension. In this context, parsing refers to the way that human beings analyze a sentence or phrase (in spoken language or text) "in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc." This term is especially common when discussing what linguistic cues help speakers to interpret garden-path sentences. Within computer science, the term is used in the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of compilers and interpreters. The term may also be used to describe a split or separation. Writing a grammar completely understandable by a computer. (COMPETENCE MODELS such as X’ module)
Wolfram Alpha (Wolfram Co., 2016) question: “what's the size of an apple?” WA answer: apple | maximum recorded trunk diameter > (data not available) What HAL 9000 should have known Speech recognition / synthesis Analysis/production of speech signal Formants identification Syllabification Word Segmentation Prosodic contours Natural language understanding / generation ●Morphology: dogs = dog + s ●Syntax: [[the boy] [eats [an apple]]]) ●Semantics: what does a word mean? And a sentence? ●Pragmatics: Is there a communicative intention beyond the literal meaning? ●Discourse: Interpreting phrases across sentences ●Information Extraction / Retrieval ●Inference … HAL 9000 is a fictional character and the main antagonist in Arthur C. Clarke's Space Odyssey series. First appearing in 2001: A Space Odyssey, HAL (Heuristically programmed ALgorithmic computer) is a sentient computer (or artificial general intelligence) that controls the systems of the Discovery One spacecraft and interacts with the ship's astronaut crew. Wolfram Alpha (also styled WolframAlpha, and Wolfram|Alpha) is a computational knowledge engine, or answer engine, developed by Wolfram Alpha LLC. It is an online service that answers factual queries directly by computing the answer from externally sourced "curated data", rather than providing a list of documents or web pages that might contain the answer as a search engine might. This device transforms the audio signals into something the computer can understand: an high frequency indicates an high pitch. The graph below represents the hight of the pitch with dark areas. The yellow line stands for the melody/intonation and the blue line represents the fundamental frequencies, showing the lowest possible frequency for that sound. Hal 9000 has not the ability to correctly elaborate the audio imput. The difficult part in language processing machines is to provide them with an algorithm that enable them to understand the audio signals.
A naïf plot of NLP applications What can we do now? Word Processing ●Syllabification (extremely good) Linguistics > Lin‐guis‐tics ●Spell‐checking (good) houze> house (T9, Swipe…) ●Grammar‐checking (bad) John sing > John sings ●Stylistic correction (bad) the nail gets removed from the board-the nail is removed from the board ●Synonyms, Opposites thesaurus ●Single word translation Human Computer Interaction ●Speech recognition (Apple Siri, o Microsoft Cortana, Google Now) /kasa/ > casa ●Pseudo‐comprehension Where could I find a Chinese restaurant nearby? > [opening map with precise] ●Natural language generation [previous context] > I’m opening your calendar… ●Information filtering/retrieval/extraction “these days China Stock Exchange collapsed” > entity: Beijing Stock Exchange status: lowering; time: end of October 2015
!! Representation of the linguistic problem Competence (data‐structure) ●What kind of information structure do we need? A word can start by wo...(word) but not by wb... The sin “sings” is different from the one in “roses” The roses are beautiful Vs.*“the are beautiful roses” “The cat chases the dog” > subj: cat(agent); verb: chase(action); obj: dog(patient) the television chases the cat “the houses” Vs. “some house” ●Specification of primitives and features at every level: Phonemes – distinctive features… Speech recognition
Phillips C. et al. (2000) Auditory Cortex Accesses Phonological Categories: An MEG Mismatch Study. Journal of Cognitive Neuroscience Speech recognition !! Representation of the linguistic problem Competence (data‐structure) ●Primitives: Phonemes – distinctive features Morphemes – combinatorial rules Words – significant morphemes bundles Phrases – natural groups of words Thematic roles – agent, patient ... Processing (competence in use) ●Combinatorial principles; how do we use primitives components of competence: Phonological level – phonotactic restrictions Morphological level – inflectional (say > says / said) and derivational rules (easy > easily) ●Processing is not just performance (that is the use of competence under resources limitation) An historic example: Sutra for Sanskrit Panini (400..600AC): 1700 base elements divided in classes (e.g. nouns, verbs ecc.)
Essential references ●Jurafsky, D. & Martin, J. H. (2000) Speech and Language Processing. Prentice‐Hall. http://www.cs.colorado.edu/~martin/slp.html (ch.2, 3 and 4… not directly related to this class, but useful for the next two lectures) Extended references ●Kennedy, G. Leech, G. & Short, M.(1998)An introduction to corpus linguistics. London:Longman ●Manning & Schütze (1999) Foundations of statistical natural language processing. MIT press. ●Lazzari, Bianchi, Cadei, Chesi e Maffei (2010) Informatica umanistica. McGraw‐Hill (capitolo4) https://www.academia.edu/1836987/Informatica_umanistica ●Lenci, Montemagni e Pirrelli(2006) Testo e computer. Carocci In linguistics, a CORPUS (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory. Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. There are two main types of parallel corpora which contain texts in two languages. In a translation corpus, the texts in one language are translations of texts in the other language. In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS- tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics. Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. Corpora can be considered as a type of foreign language writing aid as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.
Trends Hot Topics Google Trends Twitter word cloud Big Data & Machine Learning Machine learning approaches to massive linguist corpora, also known as “big data” Google Trends is a website by Google that analyses the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time. Google trends gives us, according to a certain phrase, the trend of its usage. The phrases keep us the knowledge shared and available across the time. A tag cloud (word cloud, or weighted list in visual design) is a novelty visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or colour. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence. When used as website navigation aids, the terms are hyperlinked to items associated with the tag. Big Data is all the data related to our browsing history try to find out patterns. Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction
Corpus Linguistics (Bloomfield, Harris) CORPORA: WHY WE NEED THEM
CORPUS are finite collections of linguistic information. Linguistic analyses encoded in the corpus data itself are usually called corpus annotation , defined by Leech (1997) as “The process of adding […] interpretive, linguistic information to an electronic corpus of spoken and/or written language data” For example, we may wish to annotate a corpus to show parts of speech, assigning to each word a grammatical category label. So when we see the word talk in the sentence I heard John's talk and it was the same old thing, we would assign it the category noun in that context. This would often be done using some mnemonic code or tag such as N. While the phrase corpus annotation may be unfamiliar, the basic operation it describes is not – it is just like the analyses of data that have been done using hand, eye, and pen for decades. For example, in Chomsky (1965), 24 invented sentences are analysed; in the parsed version of LOB, a million words are annotated with parse trees. So corpus annotation is largely the process of recording common analysis in a systematic and accessible form. There are two corpus typologies: Annotated and Unannotated Corpus Unannotated corpus is a simple plain or a raw text, where the linguistic information is implicit. Annotated text is no longer just a text, it is a real repository of linguistic information, which are therefore explicated. Annotation can be:
For the first-generation corpus, at list 1 million tokens (words)
Type A: face to face conversation (e.g. homebased conversations, workplace conversations, school conversation…) Type B: bidirectional mediated conversation (telephone conversations…) … ●Reference: De Mauro T., F. Mancini, M., Vedovelli, M. Voghera (1993) Lessico di frequenza dell'italiano parlato, Milano, Etas libri Before using a corpus… ●Text Normalization Il sig. P. Pallino rappresentato e difeso dall'avv. Mario Rossi, notifica, ex‐ art.150 C.P.C., aglieredie/od aventicausa di Gianni Bianchi, natoa CastelnuovoV.C. (PI) il 1°aprile1908 e deceduto in Sassari (SA) l’11 aprile2008, presso il Tribunale di Blablabla Sez. Distaccata, l'atto di sostituzione della locuzione' Figura in Catasto allapartita terreni953, foglioXI, mappale335, are 1,59' con la seguente locuzione: 'figurain Catastoallapartita terreni953, fogliX‐XI, mappale325, are 00,96'. sig. > signore (o signora?) ●Tokenization What’s a word/token? (spaces, punctuation, quotes, subscripts, numbers...) ●Lemmatization Bello for bello, belli, bella, belle ... Using an (un‐)annotated corpus Ambiguities the case of “in” preposition in Italian (http://www.treccani.it/) KeyWord in Context (KWIC)
Frequency lexicon (from «LessicoElementare», Zanichelli, 1994) Type/Token Ratio (TTR) richness of vocabulary, calculated by dividing forms(types) by occurrences (tokens). The value goes from 0(low richness) to 1(high form variety) Frequency lexicon (from «LessicoElementare», Zanichelli, 1994) (F. = frequency, D. = Dispersion) Trivia: Matt Daniels hip-hop corpus (^) Designer, coder, and data scientist Matt Daniels recently sat down to figure out once and for all which rapper has the biggest vocabulary. Matt decided that looking at the first 35,000 words delivered by various artists was a good indicator of who’s lexicon is indeed the broadest. Shakespeare’s vocabulary covers 28,829 words across his entire body of work, suggesting he knew over 100,000 words, whereas average people have a vocab of 5,000 words.
TANL(Text Analytics and Natural Language, Attardi e Simi 2009) The Italian Tanl Server is a Web service for analyzing Italian texts using the Tanl pipeline. The pipeline can be run up to various stages: Tokenize : split the input into tokens POS : tag input with POS NER : extract Named Entities SST : extract SuperSense Entities Parser : produce parse trees
POS Tagging XML annotation Inclusion indicates constituents: ●Parentheses [[A] [B C]] ●HTML
123 Mario Rossi
●XML