Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Appunti Digital Linguistics, Appunti di Linguistica

Appunti Digital Linguistics, anno 2024-2025

Tipologia: Appunti

2024/2025

In vendita dal 09/04/2026

giadaricciardi
giadaricciardi 🇮🇹

8 documenti

1 / 45

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
What is linguistics?
Linguistics is the scientific study of language and communication.
a science of language is possible, but
- In what respects is linguistics scientific?
- What is meant by science in this context?
Linguists believe that their field is a science because they share the goals
of scientific inquiry.
Like the biological sciences linguistics is concerned with observing and classifying
naturally occurring phenomena (speech sounds, words, languages, …)
- from the observed data to theories
- Linguists apply the scientific method: observations à hypotheses à theories
- hypotheses about the structure of language and test them by experimentation
The possibility of scientific understanding depends largely on the
complexity and regularity of the object of study.
Human behavior complex and not regular, nevertheless…Language contrasts with other aspects of
human behavior precisely in its regularity
Linguistics deals with
the study of particular languages,
the search for general properties common to all languages or large groups of languages.
Primary goal: to understand the nature of Language in general
What linguistics is NOT?
Linguistics is the study of language in general, not individual languages.
Some linguists are polyglots; most aren’t. However, they are familiar with more than one language.
Modern linguistics is NOT the study of how to speak properly
Linguistics is a descriptive discipline
Errors may reveal interesting aspects of language.
Grammatical vs Ungrammatical. rules
Acceptable vs Unacceptable. personal judgments
(depending on sociolinguistic variables):
He doesn’t know anything vs He don’t know nothing
Se lo avessi saputo, non sarei venuto vs Se lo sapevo, non venivo
Non c’è niente di cui ho bisogno vs Non c’è niente che ho bisogno
(depending on language structure)
• *He not know vs *Not he know vs *He know not
Interdisciplinary branches of Linguistics
• Interdisciplinary studies involve two or more academic disciplines which are
considered distinct.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Anteprima parziale del testo

Scarica Appunti Digital Linguistics e più Appunti in PDF di Linguistica solo su Docsity!

What is linguistics? Linguistics is the scientific study of language and communication. a science of language is possible, but

_- In what respects is linguistics scientific?

  • What is meant by science in this context?_ Linguists believe that their field is a science because they share the goals of scientific inquiry. Like the biological sciences linguistics is concerned with observing and classifying naturally occurring phenomena (speech sounds, words, languages, …)
  • from the observed data to theories
  • Linguists apply the scientific method: observations à hypotheses à theories
  • hypotheses about the structure of language and test them by experimentation

The possibility of scientific understanding depends largely on the complexity and regularity of the object of study. Human behavior → complex and not regular , nevertheless…Language contrasts with other aspects of human behavior precisely in its regularity Linguistics deals with ● the study of particular languages, ● the search for general properties common to all languages or large groups of languages. Primary goal : to understand the nature of Language in general

What linguistics is NOT? Linguistics is the study of language in general, not individual languages. → Some linguists are polyglots; most aren’t. However, they are familiar with more than one language.

Modern linguistics is NOT the study of how to speak properly ● Linguistics is a descriptive discipline → Errors may reveal interesting aspects of language. → Grammatical vs Ungrammatical. → rules → Acceptable vs Unacceptable. → personal judgments

(depending on sociolinguistic variables): ● He doesn’t know anything vs He don’t know nothing ● Se lo avessi saputo, non sarei venuto vs Se lo sapevo, non venivo ● Non c’è niente di cui ho bisogno vs Non c’è niente che ho bisogno (depending on language structure)

  • *He not know vs *Not he know vs *He know not

Interdisciplinary branches of Linguistics

  • Interdisciplinary studies involve two or more academic disciplines which are considered distinct.
  • The most common interdisciplinary branches of Linguistics are:
    • COMPUTATIONAL LINGUISTICS (The use of computers in the analysis of linguistic behavior and the human languages) → «Computational linguistics is the term used to describe research interested in answering linguistic questions using computational methodologies» → Linguistics is a science about natural languages → Linguistic research: language use → The use of computational methodologies to understand language and language use at a large scale «Computational linguistics is the study of computer systems for understanding and generating natural language»
      • development of a computational theory of language, exploiting the notions of algorithms and data structures from Computer Science
      • “a synonym of automatic processing of natural language, since the main task of computational linguistics is just the construction of computer programs to process words and texts in natural language”

Computational linguistics as an interdisciplinary field ● Theoretical and Applied Computer science ● Statistics ● Engineering (e.g. language or language engineering) ● Linguistics, cognitive sciences, psychology ● structure, functioning and use of language and cognitive faculties

Theoretical and applicative goals

  • the construction of computer programs to process words and texts in natural language
  • analyse and understand natural human language→ enables the computer to acquire the necessary skills to communicate directly in our language.

- PSYCHOLINGUISTICS

(What happens in people’s heads as they use language)

- NEUROLINGUISTICS (How language is processed and stored in the brain) - SOCIOLINGUISTICS (How languages vary socially)

Fields

1. Natural Language Processing (NLP) → creation of models and algorithms for understanding and generating natural language texts. «The term Natural Language Processing describes research on automatic processing of human language for practical applications» → processing = comprehension of written and oral texts

Computational linguists → have been concerned with developing procedures for handling a useful range of natural language input.

ORIGINS OF CL

1. 1950-60 →first applications of the computer to the study of philosophical and literary texts

  • Roberto Busa - Centre for the automation of linguistic analysis in Gallarate first electronic corpus of the works of Thomas Aquinas (about 10 million words) software for their exploration through concordances First results of CL: ● development of software for electronic text analysis ● calculation of word frequency ● compilation of indexes and concordances ● creation of electronic **lexical directories
  1. 1957 ‘Syntactic Structures’ by Noam Chomsky** → formal grammars and application of formal methods to language analysis → beginning of an intensive investigation into the properties of natural language → research in the field of Artificial Intelligence First results of CL: ● early software for syntactic analysis and automatic semantic interpretation ● applications such as machine translation and human-machine interfaces in natural language.

Origins of Corpus Linguistics ● in England ● Parallel to the spread of Generative Grammar ● Linguistic investigation on the collection and analysis of corpora = collections of large quantities of texts belonging to a certain language variety

Armchair linguist He sits in a deep soft comfortable chair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes […] shouting, ‘Wow, what a neat fact!’ […] and writes something down. Then he paces around for a few hours in the excitement of having come still closer to knowing what language is really like.

  • Corpus linguist He has all of the primary facts that he needs, in the form of a corpus of approximately one zillion running words. At the moment he is busy determining the relative frequencies of the eleven parts of speech as the first word of a sentence versus as the second word of a sentence.

These two don’t speak to each other very often, but when they do, the corpus The linguist says to the armchair linguist, ‘Why should I think that what you tell me is true?’, and the armchair linguist says to the corpus linguist, ‘Why should I think that what you tell me is interesting?’

  1. As ‘hyphenated branches’ of linguistics, e.g.: ● Sociolinguistics (the relation between language and society), ● Psycholinguistics (the relation between language and the mind), ● Neurolinguistics (the relation between language and neurological processes in the brain)
  2. not a branch of linguistics, nor a linguistic theory ● The word ‘corpus’ does not tell you what is studied, but rather that a particular methodology is used. ● one of the possible ways of ‘doing’ linguistics (Biber et al. 1998: 3–4; Kennedy 1998: 7; Leech 1992, McEnery and Wilson 2001: 2; Meyer 2002: xi)

However, corpus linguistics is more than just a methodology:

  • “a new research enterprise” and “a new philosophical approach to the subject” (Leech 1991: 106)
  • corpus linguistics has a “theoretical status” (Tognini-Bonelli 2001: 1-2):
  • corpora need not simply be used to test existing theories formulated on the basis of intuitions
  • observations of language facts lead to the formulation of hypotheses → unified in a theoretical statement

Corpus linguistics as empirical linguistics (Sampson 2001: 6) ● quantitative and statistical analysis tools to explore linguistic regularities emerging from texts

  • It examines, and draws conclusions from, attested language use, rather than intuitions. → the description of language as inseparable from the analysis of its use
  • Role of intuitions in corpus linguistics? → they do not provide the data for analysis
  • It examines samples, however large, of language use, as it is typically impossible to capture the entirety of a language in a corpus.
  • Modern corpus linguistics and the use of computers
  • Corpus linguistics as ‘computer corpus linguistics’ (Leech 1992: 106)
  • Functions of computers (Hunston 2002: 20): →facilitating the collection and storage of large amounts of language data, → enabling the development of the software that is used to access and analyze the corpus data.

The benefits of the use of computers in corpus linguistics

  1. Computers and software programs have enabled researchers to collect, storeand manage vast amounts of data relatively quickly and inexpensively.
  2. Data analysis and processing à fast and often automated.
  3. Automated processes allow for
    • replicability of studies
    • checking the statistical reliability of results

Corpus linguists have become more theoretically sophisticated and aware of the pros and cons of corpora.

  • corpus-based linguistics

● new focus on statistical data ● new focus on the role of probabilistic models In cognitive science : ● the human mind is not only capable of processing rules, but also of keeping track of statistical regularities

Computational linguistics in Italy Antonio Zampolli (1937-2003) → founded and directed the Institute of Computational Linguistics of the National Research Council (CNR)

  • first chair of Mathematical Linguistics at the University of Pisa in 1977
  • 2015 Italian Association of Computational Linguistics

NLP TASKS

The aim of NLP → develop computational models that try to solve linguistic tasks Tasks → concern all possible levels of linguistic analysis • E.g. tasks dealing with semantics → e.g. word sense disambiguation

  • E.g. tasks dealing with pragmatics → e.g. the classification of linguistic acts Three kinds of tasks:Pre-processing tasks → prepare the text for further processing ● Classification tasks → label portions of text of various extensions and assign classes to these portions → Information Extraction tasks à extract information from text ● Generation tasks → produce natural language texts

The text = complex structure

  • contains information articulated on several levels:
    • sequence of characters
    • words
    • text units (titles, chapters and paragraphs)

What is a text for a computer?

  • A computer does not possess this knowledge and is only able to see and manipulate sequences of binary codes → 0 and 1 → bit (BInary digiT)
  • How can we set up a text so that a computer can ‘read’ it to grasp the types of information that is useful for linguistic investigation? - digital encoding Encoding levels: The binary representation of a text has several levels
    1. Low level encoding → binary representation of the sequence of characters in the text
    2. High level encoding → information related to the linguistic/textual structure It converts textual data into an explicit source of linguistic information - data vs. information
      • A piece of data becomes information when it is linked to a context ex. 24121 is a Postcode? Account?

What can we encode?

  • e.g. macro-structure of the text
  • e.g. morphological structure
  • e.g. syntactic structure

E.g. encoding at the morpho-syntactic level

  • the lexical category of the word
  • morphosyntactic attributes (e.g. gender, person, number, time)
  • values that can be associated with each attribute (e.g. masculine/feminine/neutral for gender) → Encoding schema:
    • Set of encoding categories (attributes and possible values)
    • Set of rules to define the compatibility between categories (e.g. the adjective does not have an inherent attribute of person) The richness and variety of attributes and values of an annotation scheme may depend on various factors:
    • linguistic theory
    • the kind of linguistic distinctions that the categories express
    • time and human resources devoted to the encoding process

How is the content represented? • Digital formats → Plain text (high portability)Doc and Pdf (with formatting instructions): maximum format expressiveness but minimum portability → Mark-up languages: XML (eXtensible Mark-up Language) It is used to highlight and make explicit→ the relationships between the text and its parts and the interpretations associated with them (E.g. linguistic element + tag) A text encoded with a markup language is still in text-only format (Structural information is represented through the addition to the text of markup labels (or tags) → the most appropriate digital encoding of textual materials for LC

XML → is NOT a programming language. It is used to create customised mark-up languages Structure:

  • Text unit = element
  • Each element is identified by a name ( tag )
  • each element is marked explicitly by inserting an opening < tag > delimiter at the start of the element and a closing < /tag > delimiter Everything between the opening and closing delimiter represents the content of the element
  • Act I </subtitle>
  • </paragraph> Enter LEONATO, Governor of Messina, Hero, his daughter, Beatrice, his niece, with a Messenger </paragraph> FINE PRES.

The text

What is a word for a computer program? Corpus Linguistics → A word is a sequence of letters «My dog loves cats and cats love my dog» my | dog | loves | cats | and | cats | love | my | dog There are 9 words:

  • 9 occurrences of word-forms→ «TOKENS»
  • 6 word-form types: my, dog, loves, cats, and, love) → «TYPES»
  • 5 base-forms: MY, DOG, LOVE, CAT, AND → «LEMMAS»

TOKEN = an individual occurrence of linguistic units. Any single linguistic unit, most often a word → words between white spaces or symbols NB: A single word can be split into more than one token ex. E.g. he’s ( he + ’s ) 2 tokens Repeated words (words that occur more than once in the corpus) can be double counted ex. There are 9 tokens in the sentence “my dog loves cats and cats love my dog”. → The total number of tokens can be an estimation of corpus size

AUTOMATIC TOKENIZATION (word segmentation) Tokenization is the process of detecting words and separating punctuation from written words. The automatic process of converting a text into a list of separate tokens:

  • separating punctuation (such as commas and full stops) from words
  • removing capitalisation
  • verb contractions and the Anglo-Saxon genitive of nouns are split into their morphemes, and each morpheme is tagged separately.
  • E.g. • He’s → he ‘s John’s → John 's Tokenization is usually the first stage in lemmatization or part-of-speech tagging

TYPE the abstract class of which tokens are members = the number of distinct words, not counting repetitions → (grouping occurrences of a word together as representatives of a single type) (While the number of tokens in a corpus refers to the total number of words, the number of types refers to the total number of unique words). E.g. the verb eats may occur 177 times in a corpus

  • 177 tokens = individual occurrences of the linguistic unit “eats”
  • 1 type = word representative of the abstract class of which tokens are members

How many words in a corpus? The number of tokens in the corpus is an estimate of overall corpus size (= the corpus dimension). The number of types is an estimate of vocabulary size → The vocabulary size gives an idea of the lexical richness (=the lexical variety) of the corpus

The number of types (unique words) in a text, divided by the number of tokens (total number of words) and expressed as a percentage. → A way of measuring the amount of variation in the vocabulary in the corpus (= “the rate at which new types are introduced” • A high type/token ratio → a text is lexically diverse

  • A low type/token ratio → there is a lot of repetition of lexical items The larger the corpus or file is, the lower the type/token ratio will be, due to the repetitive nature of function words.

The skills of the computational linguist The computational linguist's skill set includes:

  1. ability to select and collect the most appropriate linguistic data for his purposes;
  2. knowledge of formal methods (statistical, mathematical and computational methods);
  3. the mastery of computer techniques and tools with which to automatically conduct the analyses.

Our goals → Collection of linguistic data and data representation of data for automatic processing

The source How to identify the sources from which to draw the necessary data for the analysis? Two main sources of evidence: ● the texts of speakers of a language → highly structured texts, transcripts of spontaneous conversations, encyclopaedias, text messages ● the speakers themselves → judgements on expressions, time with which words are recognised on a screen, expressions used to describe a scene

The context Based on the «naturalness» of the context in which the data are collected:

  • Ecological data → observed and collected in their own environment «naturalness»
  • Controlled data → obtained by testing the speakers of a language = controlled

The corpus Text corpora represents the main (though not exclusive) source of data in CL. The spread of corpora has been fostered by:

  • the renewed interest in the use of statistical methods
  • the development of computer technology The notion of corpus → before the origin of computers:
  • common practice in the study of language before the emergence of Chomskyan generative grammar

data are generally distinguished according to:

  • Truthfulness → natural data or artificial data ● Natural language data: text produced in the real world Artificial language: any language which is not natural (e.g. markup languages)
  • Data source → attested data, modified data (taken and modified), intuitive data Attested data: (also actual or authentic data) • occur naturally • have been transcribed without intervention from the researcher ● Modified data: based on attested data but have been modified in some way ( for instance simplified) to exclude aspects which are extraneous ● Intuitive data: (also introspective or invented data) • based on real life examples, intuitive data are invented to illustrate a particular linguistic point - Type of language they represent ● Medium, Genre, Register, Text type, etc. This distinction is not always very clearly formalised → different scholars may use distinct, but frequently also overlapping terminology to represent Similar Things

NUMBER OF LANGUAGES & VARIETIES OF THE SAME LANGUAGE

1. LANGUAGES

monolingual corpora ( BNC, Brown corpus), it contains texts in a single language. The correspondence Among participating languages is at the level of design rather than at the level of the choice of actual texts. the same quantities of the same types of texts would be assembled → but not translations → the number of tokens is the most important feature we must consider. ● Bilingual or multilingual corpora ( C-Oral- Rom): Parallel corpora ( L1-L2) → it contains the original text and its specific translation → texts and target texts es. Europarl, Canadian Hansard's Corpusè presente il testo originale e accanto la sua traduzione Qui possiamo vedere che ci sono diverse Traduzioni per la parola filed → campo, settore, in merito a →This exemplify that languages differs in terms of situation in which they have been used

Multilingual corpora , collected according to the same theme. It contains texts produced in different languages, but the translation is not present separately → è presente il testo in diverse lingue, ma non una traduzione specifica

2. MEDIUM - Written (LA REPUBBLICA, BROWN CORPUS) could contain different data. Brown Corpus contains specific terms, it depends on the general theme: press, religion, Popular lore, Belles letters, biography memoirs → each of them plays a dominant role - Spoken ( LIP, CHILDES → include texts traduced by children) it is difficult to report in the sense of technical complications, it is very expensive in terms of time for: report, transcribe. It is often under- represented in general corpora and after all, most people speak more than they write and listen to others talking more than they read - Mixed ( BNC → includes spoken and written text) Balance between spoken and written language in order to be more representative of language in general. The British national corpus (BNC) is the most important mixed corpus. It started in 1991 and was run by Oxford University press. It contains a 100 million word corpus, a balanced representation of BrE in the 1990s. → it doesn't contain the words created after this time, in fact if we search the word “google” we won't find it. Among the written texts, 25% are fiction and 75% nonfiction (books, periodically, brochures and unpublished material). Spoken component: it …

  • Multimedial corpus → report audio or video direction and it also includes the transcribed texts

3. DOMAIN

  • Specialized → it presents a specific domain es. EUROPARL , is a specialised ones since it presents a specific domain of a language - General → es. BNC since it represents a different kinds of texts (90% written texts and 10% spoken texts) - Reference corpora → the most representative in terms of language varieties
  1. TEMPORAL DIMENSION
  • synchronic→ a corpus in which all of the texts have been collected from roughly the same period , allowing a “snapshot” of language use at a particular point in time. ● Brown Corpus → 1991 ● La Repubblica → 1985-
  • Diachronic a corpus that has been carefully built in order to be representative of a language or language variety over a particular period of time ● OVI → origins - 1375 5. UPDATES
  • Static corpora (La Repubblica) → A sample text corpus that is intended to be of a particular size– once that target is reached, no more texts are included in it. Most corpora

→ Additional information = high level encoding

DATA ENCODING

data = raw content Information= data+ interpretation (structure) Knowledge= information + theory

Corpus main contains three types of information that can help in the investigation data:

  • metadata→ an information that tells you something about the text itself, ex. Who wrote the text, year of publication ecc.. - Textual markup → info within the text other than the words ● written corpora: sentences breaks or paragraph breaks ● Spoken corpora: the speaker, the age of him, sex, social class
  • Linguistic annotation → linguistic information within a corpus text. It typically uses the same encoding conventions as textual markup. Annotations are automatically applied to text, by analysis software (taggers) such as: part-of-speech taggers, lemmatisers, semantic taggers, syntactic taggers A. PoS-tagging → Part of Speech (PoS) tag referred to each word - “eat”= VERB - “high” = ADJECTIVE - “book” = NOUN B. LEMMATISATION → a word is associated to its lemma ex. book, books → Lemma= “book” ( ogni forma di flessione di parola corrisponde allo stesso lemma). Lemmatisation is a form of automatic annotation. It Involves the reduction of the words in a corpus to their respective lexemes It is useful because allows the researcher to extract and examine all the variants of a particular lexeme without having to input all of them and to produce frequency and distribution information for the lexeme

FEATURES OF TODAY'S CORPORA

  • MIXED (spoken/written corpora → from the web)
  • AUDIO
  • MULTILINGUAL → from the web
  • SPECIALIZED TEXTS
  • SUBCORPORA
  • OWN SOFTWARE Main types of annotation: - PoS tagging → helps to have precision and recall → learning machine ● measure how to successful an amateur tagger is ● Both expressed in percentage ( ideally both precision and Recall should be 100%)

- Precision → has to do with accuracy of positive predictions =how many of the predicted tags are connected? → ex. There are 100 words as verbs but only 90 words are actually verbs → precision 90% - Recall → how many of the relevant tokens in the corpus were correctly identified? → there are 120 words and the software identifies only 90 → 90/120= 75% PoS presents some problems:

  • It is not always obvious what the correct tag should be
  • It is often hard to fit authentic language into categories set up in a grammatical theory → the Software is not always able to recognize all the features, so the most of the results are based on possibility/deduction → when we are annotating something, there are three annotators and thenwe search the agreement annotation
  • The tags must be put there in the first place = There are many corpora which specify the grade of the verbs and their flexions ( ex. Comparative forms, superlative…). Many corpora has a specific tag for each specific word → ex. che in Italian, it could be a conjunction or a relative pronoun.

Automatic taggers often claim to have a 97/98 % successful rate 100% success rate with common unambiguous words like the and a If you are working with an automatically tagged corpus which has not been manually post-edited, you must expect to find quite a few wrongly tagged items in the output ex. searching the noun fine in COCA returned fifty hits. They included:

  • fine = Noun «Often officials pocket part or all of the ‘excess birth’fine of $280»
  • fine = Adjective «One shows Michael all dressed up, a fine-looking man»

B. LEMMATISATION a specific type of annotation involving the reduction of the words in a corpus to their respective lexemes Allows the researcher to extract and examine all the variants of a particular lexeme without having to input all of them and to produce frequency and distribution information for the lexeme. The lemma is written as small capitals ex. book/books → Lemma: BOOK

C. PARSING

Assigning a syntactic analysis of the corpus

  • Is has not yet been a successful as PoS tagging
  • It is a more difficult undertaking, not least because it is even more dependent on which linguistic theory you subscribe to. The most common type of parsing → Phrase structure analysis where phrases and clauses are given functional labels (e.g. adverbial clause, adjective phrase, temporal noun phrase, past participle clause) Treebank → (the structures can be represented as tree diagrams even if they are normally printed as series of brackets). Automatic parsing is especially important for a number of technical applications in the field of natural language processing (NLP)

It is common to distinguish two main types of scientific method in linguistics (and in general): qualitative and quantitative.

1. QUALITATIVE ANALYSIS - Morphological or syntactic information about words - Syntagmatic and paradigmatic properties of a word → concordances of tokens

  • distinct or new uses of a word we already know
  • Regularities of co-occurance of two or more words
  • Regularities of co-occurance of words and morpho-syntactic patterns: • What are the contexts of use of a word? ~ the verb to give: ditransitive verb ( he gives a book to Mary/ he gives Mary a book)
  • ditransitive verbs, transitive verbs, intransitive verbs, collocations

2. QUANTITATIVE ANALYSIS

frequency of occurrence -One of the main advantages of computers is that we can easily get

  • frequency data from large masses of text, which would be virtually
  • impossible to achieve by hand. Typical observations about word frequencies in corpora: ● there are a few words with extremely high frequency ● there are many more words with extremely low frequency Frequency can be used to check: - how often words occur in a corpus - the most frequent forms - the co-occurrence of forms. Frequencies can be compared in order to describe differences between genres, geographical varieties, spoken and written language, text from different time periods and so on What can corpora show us? Summarizing → The most frequent forms. Some words are the basic “skeleton” for our sentences. Language is regular and non-random

Synonymy is a relationship of identity, where the meaning of two word senses is the same. Two synonymous have to be identical in their

  • semantic features
  • connotations
  • grammatical behaviour It would therefore only be natural to expect that synonymy would also be reflected in a strong degree of collocational overlap. True synonymy is very rare, but languages have many partial synonyms. ex big (Of considerable size or extent) and large ( Of considerable or relatively great size, extent, or capacity)

Comparing frequencies How do frequencies differ between corpora or subcorpora? Comparison→ to investigate a number of phenomena, ex. differences between the language of men and women, the development of language over time, the differences between regional varieties There are instances where figures can be somewhat misleading. Often due to the fact that the total number of tokens is too low To draw safe conclusions from such comparisons, a certain level of statistical awareness is needed. Recently, some researchers have complained about the low level of statistical expertise in corpus studies. Some of the methods are quite complicated

Statistical significance In statistics, several methods have been developed to measure the ‘statistical significance’ of quantitative findings NB: The degree to which corpus linguists use such methods varies. Two kinds of statistics that everyone using corpus data should know about: A) Significance testing → tests whether observed patterns and figures are meaningful → chi square test B) Measurement of the strength of lexical associations → Measuring the collocational strength → mutual information (MI) • log-likelihood (LL)

A. CHI SQUARE TEST With the chi square test, you can test whether the measured difference between two groups is statistically significant or likely to be due to chance. You need to have: ● 2 independent variables → The variables that define the groups you want to compare ex. gender (male / female) /language variety (British English / American English) → independent because they don’t depend on other factors ● A number of dependent variables → the variables that you measure or count within each group. Ex. frequency of a certain word, use of a grammatical construction → Dependent because their values depend on the group (i.e., on the independent variable)

Distribution in the corpus A word can get misleadingly high frequency figures in a corpus by being frequent in just one or a few texts or genres, while it is absent in all the others. This could be because: