





































Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Appunti Digital Linguistics, anno 2024-2025
Tipologia: Appunti
1 / 45
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!






































What is linguistics? Linguistics is the scientific study of language and communication. a science of language is possible, but
_- In what respects is linguistics scientific?
The possibility of scientific understanding depends largely on the complexity and regularity of the object of study. Human behavior → complex and not regular , nevertheless…Language contrasts with other aspects of human behavior precisely in its regularity Linguistics deals with ● the study of particular languages, ● the search for general properties common to all languages or large groups of languages. Primary goal : to understand the nature of Language in general
What linguistics is NOT? Linguistics is the study of language in general, not individual languages. → Some linguists are polyglots; most aren’t. However, they are familiar with more than one language.
Modern linguistics is NOT the study of how to speak properly ● Linguistics is a descriptive discipline → Errors may reveal interesting aspects of language. → Grammatical vs Ungrammatical. → rules → Acceptable vs Unacceptable. → personal judgments
(depending on sociolinguistic variables): ● He doesn’t know anything vs He don’t know nothing ● Se lo avessi saputo, non sarei venuto vs Se lo sapevo, non venivo ● Non c’è niente di cui ho bisogno vs Non c’è niente che ho bisogno (depending on language structure)
Interdisciplinary branches of Linguistics
Computational linguistics as an interdisciplinary field ● Theoretical and Applied Computer science ● Statistics ● Engineering (e.g. language or language engineering) ● Linguistics, cognitive sciences, psychology ● structure, functioning and use of language and cognitive faculties
Theoretical and applicative goals
(What happens in people’s heads as they use language)
- NEUROLINGUISTICS (How language is processed and stored in the brain) - SOCIOLINGUISTICS (How languages vary socially)
Fields
1. Natural Language Processing (NLP) → creation of models and algorithms for understanding and generating natural language texts. «The term Natural Language Processing describes research on automatic processing of human language for practical applications» → processing = comprehension of written and oral texts
Computational linguists → have been concerned with developing procedures for handling a useful range of natural language input.
ORIGINS OF CL
1. 1950-60 →first applications of the computer to the study of philosophical and literary texts
Origins of Corpus Linguistics ● in England ● Parallel to the spread of Generative Grammar ● Linguistic investigation on the collection and analysis of corpora = collections of large quantities of texts belonging to a certain language variety
Armchair linguist He sits in a deep soft comfortable chair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes […] shouting, ‘Wow, what a neat fact!’ […] and writes something down. Then he paces around for a few hours in the excitement of having come still closer to knowing what language is really like.
These two don’t speak to each other very often, but when they do, the corpus The linguist says to the armchair linguist, ‘Why should I think that what you tell me is true?’, and the armchair linguist says to the corpus linguist, ‘Why should I think that what you tell me is interesting?’
However, corpus linguistics is more than just a methodology:
Corpus linguistics as empirical linguistics (Sampson 2001: 6) ● quantitative and statistical analysis tools to explore linguistic regularities emerging from texts
The benefits of the use of computers in corpus linguistics
Corpus linguists have become more theoretically sophisticated and aware of the pros and cons of corpora.
● new focus on statistical data ● new focus on the role of probabilistic models In cognitive science : ● the human mind is not only capable of processing rules, but also of keeping track of statistical regularities
Computational linguistics in Italy Antonio Zampolli (1937-2003) → founded and directed the Institute of Computational Linguistics of the National Research Council (CNR)
The aim of NLP → develop computational models that try to solve linguistic tasks Tasks → concern all possible levels of linguistic analysis • E.g. tasks dealing with semantics → e.g. word sense disambiguation
The text = complex structure
What is a text for a computer?
What can we encode?
E.g. encoding at the morpho-syntactic level
How is the content represented? • Digital formats → Plain text (high portability) → Doc and Pdf (with formatting instructions): maximum format expressiveness but minimum portability → Mark-up languages: XML (eXtensible Mark-up Language) It is used to highlight and make explicit→ the relationships between the text and its parts and the interpretations associated with them (E.g. linguistic element + tag) A text encoded with a markup language is still in text-only format (Structural information is represented through the addition to the text of markup labels (or tags) → the most appropriate digital encoding of textual materials for LC
XML → is NOT a programming language. It is used to create customised mark-up languages Structure:
The text
What is a word for a computer program? Corpus Linguistics → A word is a sequence of letters «My dog loves cats and cats love my dog» my | dog | loves | cats | and | cats | love | my | dog There are 9 words:
TOKEN = an individual occurrence of linguistic units. Any single linguistic unit, most often a word → words between white spaces or symbols NB: A single word can be split into more than one token ex. E.g. he’s ( he + ’s ) 2 tokens Repeated words (words that occur more than once in the corpus) can be double counted ex. There are 9 tokens in the sentence “my dog loves cats and cats love my dog”. → The total number of tokens can be an estimation of corpus size
AUTOMATIC TOKENIZATION (word segmentation) Tokenization is the process of detecting words and separating punctuation from written words. The automatic process of converting a text into a list of separate tokens:
TYPE the abstract class of which tokens are members = the number of distinct words, not counting repetitions → (grouping occurrences of a word together as representatives of a single type) (While the number of tokens in a corpus refers to the total number of words, the number of types refers to the total number of unique words). E.g. the verb eats may occur 177 times in a corpus
How many words in a corpus? The number of tokens in the corpus is an estimate of overall corpus size (= the corpus dimension). The number of types is an estimate of vocabulary size → The vocabulary size gives an idea of the lexical richness (=the lexical variety) of the corpus
The number of types (unique words) in a text, divided by the number of tokens (total number of words) and expressed as a percentage. → A way of measuring the amount of variation in the vocabulary in the corpus (= “the rate at which new types are introduced” • A high type/token ratio → a text is lexically diverse
The skills of the computational linguist The computational linguist's skill set includes:
Our goals → Collection of linguistic data and data representation of data for automatic processing
The source How to identify the sources from which to draw the necessary data for the analysis? Two main sources of evidence: ● the texts of speakers of a language → highly structured texts, transcripts of spontaneous conversations, encyclopaedias, text messages ● the speakers themselves → judgements on expressions, time with which words are recognised on a screen, expressions used to describe a scene
The context Based on the «naturalness» of the context in which the data are collected:
The corpus Text corpora represents the main (though not exclusive) source of data in CL. The spread of corpora has been fostered by:
data are generally distinguished according to:
● monolingual corpora ( BNC, Brown corpus), it contains texts in a single language. The correspondence Among participating languages is at the level of design rather than at the level of the choice of actual texts. the same quantities of the same types of texts would be assembled → but not translations → the number of tokens is the most important feature we must consider. ● Bilingual or multilingual corpora ( C-Oral- Rom): Parallel corpora ( L1-L2) → it contains the original text and its specific translation → texts and target texts es. Europarl, Canadian Hansard's Corpus → è presente il testo originale e accanto la sua traduzione Qui possiamo vedere che ci sono diverse Traduzioni per la parola filed → campo, settore, in merito a →This exemplify that languages differs in terms of situation in which they have been used
Multilingual corpora , collected according to the same theme. It contains texts produced in different languages, but the translation is not present separately → è presente il testo in diverse lingue, ma non una traduzione specifica
2. MEDIUM - Written (LA REPUBBLICA, BROWN CORPUS) could contain different data. Brown Corpus contains specific terms, it depends on the general theme: press, religion, Popular lore, Belles letters, biography memoirs → each of them plays a dominant role - Spoken ( LIP, CHILDES → include texts traduced by children) it is difficult to report in the sense of technical complications, it is very expensive in terms of time for: report, transcribe. It is often under- represented in general corpora and after all, most people speak more than they write and listen to others talking more than they read - Mixed ( BNC → includes spoken and written text) Balance between spoken and written language in order to be more representative of language in general. The British national corpus (BNC) is the most important mixed corpus. It started in 1991 and was run by Oxford University press. It contains a 100 million word corpus, a balanced representation of BrE in the 1990s. → it doesn't contain the words created after this time, in fact if we search the word “google” we won't find it. Among the written texts, 25% are fiction and 75% nonfiction (books, periodically, brochures and unpublished material). Spoken component: it …
→ Additional information = high level encoding
data = raw content Information= data+ interpretation (structure) Knowledge= information + theory
Corpus main contains three types of information that can help in the investigation data:
FEATURES OF TODAY'S CORPORA
- Precision → has to do with accuracy of positive predictions =how many of the predicted tags are connected? → ex. There are 100 words as verbs but only 90 words are actually verbs → precision 90% - Recall → how many of the relevant tokens in the corpus were correctly identified? → there are 120 words and the software identifies only 90 → 90/120= 75% PoS presents some problems:
Automatic taggers often claim to have a 97/98 % successful rate 100% success rate with common unambiguous words like the and a If you are working with an automatically tagged corpus which has not been manually post-edited, you must expect to find quite a few wrongly tagged items in the output ex. searching the noun fine in COCA returned fifty hits. They included:
B. LEMMATISATION a specific type of annotation involving the reduction of the words in a corpus to their respective lexemes Allows the researcher to extract and examine all the variants of a particular lexeme without having to input all of them and to produce frequency and distribution information for the lexeme. The lemma is written as small capitals ex. book/books → Lemma: BOOK
Assigning a syntactic analysis of the corpus
It is common to distinguish two main types of scientific method in linguistics (and in general): qualitative and quantitative.
1. QUALITATIVE ANALYSIS - Morphological or syntactic information about words - Syntagmatic and paradigmatic properties of a word → concordances of tokens
→ frequency of occurrence -One of the main advantages of computers is that we can easily get
Synonymy is a relationship of identity, where the meaning of two word senses is the same. Two synonymous have to be identical in their
Comparing frequencies How do frequencies differ between corpora or subcorpora? Comparison→ to investigate a number of phenomena, ex. differences between the language of men and women, the development of language over time, the differences between regional varieties There are instances where figures can be somewhat misleading. Often due to the fact that the total number of tokens is too low To draw safe conclusions from such comparisons, a certain level of statistical awareness is needed. Recently, some researchers have complained about the low level of statistical expertise in corpus studies. Some of the methods are quite complicated
Statistical significance In statistics, several methods have been developed to measure the ‘statistical significance’ of quantitative findings NB: The degree to which corpus linguists use such methods varies. Two kinds of statistics that everyone using corpus data should know about: A) Significance testing → tests whether observed patterns and figures are meaningful → chi square test B) Measurement of the strength of lexical associations → Measuring the collocational strength → mutual information (MI) • log-likelihood (LL)
A. CHI SQUARE TEST With the chi square test, you can test whether the measured difference between two groups is statistically significant or likely to be due to chance. You need to have: ● 2 independent variables → The variables that define the groups you want to compare ex. gender (male / female) /language variety (British English / American English) → independent because they don’t depend on other factors ● A number of dependent variables → the variables that you measure or count within each group. Ex. frequency of a certain word, use of a grammatical construction → Dependent because their values depend on the group (i.e., on the independent variable)
Distribution in the corpus A word can get misleadingly high frequency figures in a corpus by being frequent in just one or a few texts or genres, while it is absent in all the others. This could be because: