Corpora in Applied Linguistics: introduction

 A corpus is defined in terms of both its form and its purpose. It’s a collection of naturally occurring

examples of language, consisting of anything from a few sentences to a set of written texts or tape

recordings, which have been collected for linguistic studies.

 Electronic corpora are usually larger than paper-based collections.

 A corpus doesn’t contain new information about language, but the software offers us a new perspective

on the familiar. Most readily software packages process data from a corpus in three ways: showing

frequency, phraseology and collocation.

Frequency: The words in a corpus can be arranged in order of their frequency in that corpus. Corpora

could also be compared in term of their frequency list. The result is that grammar words are more frequent

than lexical words, f.e: the, of, to, and, a, in occupy the first six places in each corpus. The only lexical word

that occurs into the top 50 words of the General Bank of English corpus is said.

Then, there are some words that occurs more frequently according to the theme of the corpora, f.e.: in

politics occurs states, international; in science words like electron, energy. The word this occurs more

frequently in science and political lists than in the general list because it is always used to summarized what

has been said before.

In addition, the use of given words may differ according to different corpus. F.e: there are differences

between the use of must or have to, incredibly or surprisingly in different corpus like books, newspapers or

spoken English. Another example of difference in frequency is the words: man, woman, husband and wife.

(generally man occurs more frequently than woman, but wife occurs more frequently than husband

because a woman is usually related to the man she has married……..).

More sophisticated is the comparison between frequencies in different registers. These are: conversation,

fiction, news and academic. (generally, in the first and the last, present tense occurs more frequently than

past tense).

Phraseology: It is observed through concordance (concordance lines bring together many instances of

use of a word or phrase). Phraseology is used as a way to clarify differences between words, f.e.: interested

and interesting (interested is often near to in, interesting is often used before a noun), between and

through (between is generally found after nouns, through after verbs).

Collocation: it is the statistical tendency of words to co-occur, f.e.: collocates of shed include: light, tears,

blood (with whom shed is a verb), but also garden (with whom shed is a noun). Collocation indicates pairs

of lexical items or the association between a lexical word and its frequent grammatical environment.

 A corpora is used: for teaching a language (it can give information about how a language works and

that may not be accessible to native speaker intuition), by students encouraged by their teachers and by

translators to see how words have been translated in the past, to establish norms of frequency and usage

and to investigate cultural attitudes expressed through language.

 There are different types of corpora that depend on their purpose:

Specialized corpus: it includes texts of a particular time such as newspapers, textbooks, articles in a

particular subject, lectures, essays etc. It is used to investigate a particular type of language. Ex: Cambridge

and Nottingham corpus of discourse in English (CANCODE).

General corpus: it includes texts of many types. It may include written or spoken language or both. It is

larger than specialized corpus. Ex: Bank of English.

Comparable corpora: they are two or more corpora in different languages or varieties. They can be used by

translator and learners to identify differences and equivalences in each language.

Parallel corpora: Two or more corpora in different languages, each containing texts that have been

translated into one of the two languages to the other.

Anteprima parziale del testo

Scarica Corpora in applied linguistics e più Appunti in PDF di Lingua Inglese solo su Docsity!

Corpora in Applied Linguistics: introduction

 A corpus is defined in terms of both its form and its purpose. It’s a collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings, which have been collected for linguistic studies.  Electronic corpora are usually larger than paper-based collections.  A corpus doesn’t contain new information about language, but the software offers us a new perspective on the familiar. Most readily software packages process data from a corpus in three ways: showing frequency, phraseology and collocation.

 Frequency : The words in a corpus can be arranged in order of their frequency in that corpus. Corpora

could also be compared in term of their frequency list. The result is that grammar words are more frequent than lexical words, f.e: the, of, to, and, a, in occupy the first six places in each corpus. The only lexical word that occurs into the top 50 words of the General Bank of English corpus is said. Then, there are some words that occurs more frequently according to the theme of the corpora, f.e.: in politics occurs states, international ; in science words like electron, energy. The word this occurs more frequently in science and political lists than in the general list because it is always used to summarized what has been said before. In addition, the use of given words may differ according to different corpus. F.e: there are differences between the use of must or have to , incredibly or surprisingly in different corpus like books, newspapers or spoken English. Another example of difference in frequency is the words: man, woman, husband and wife. (generally man occurs more frequently than woman, but wife occurs more frequently than husband because a woman is usually related to the man she has married……..). More sophisticated is the comparison between frequencies in different registers. These are: conversation, fiction, news and academic. (generally, in the first and the last, present tense occurs more frequently than past tense).

 Phraseology: It is observed through concordance (concordance lines bring together many instances of

use of a word or phrase). Phraseology is used as a way to clarify differences between words, f.e.: interested and interesting (interested is often near to in , interesting is often used before a noun), between and through (between is generally found after nouns, through after verbs).

 Collocation: it is the statistical tendency of words to co-occur, f.e.: collocates of shed include: light, tears,

blood (with whom shed is a verb), but also garden (with whom shed is a noun). Collocation indicates pairs of lexical items or the association between a lexical word and its frequent grammatical environment.

 A corpora is used : for teaching a language (it can give information about how a language works and

that may not be accessible to native speaker intuition), by students encouraged by their teachers and by translators to see how words have been translated in the past, to establish norms of frequency and usage and to investigate cultural attitudes expressed through language.

 There are different types of corpora that depend on their purpose:

Specialized corpus: it includes texts of a particular time such as newspapers, textbooks, articles in a particular subject, lectures, essays etc. It is used to investigate a particular type of language. Ex: Cambridge and Nottingham corpus of discourse in English (CANCODE). General corpus: it includes texts of many types. It may include written or spoken language or both. It is larger than specialized corpus. Ex: Bank of English. Comparable corpora: they are two or more corpora in different languages or varieties. They can be used by translator and learners to identify differences and equivalences in each language. Parallel corpora: Two or more corpora in different languages, each containing texts that have been translated into one of the two languages to the other.

Learners corpus: It includes texts produced by learners of a language. Ex: The international Corpus of Learner English (ICLE) Pedagogic corpus: it is a corpus consisting of all the languages a learner has been exposed to. The term was coined by Willis and for most learners their pedagogic corpus does not exist in physical form. Historical or diachronic corpus: it includes texts of different periods of time. It’s used to see the development of a language. Ex: Helsinki Corpus. Monitor corpus: It is useful to track current changes in a language. It is added yearly, monthly or daily so it rapidly increases in size.

 There are some key terms:

 Type: Counting each repeated item once only.

 Token: All the words in a text.  Hapax: Words in a text that occur only once.  Lemma and word-forms: More word forms belong to the same lemma (big group). Ex: eat, eats, eating belong to the lemma Eat.  Tag, parse and annotate: They are 3 terms that refer to add information to the words in a corpus. The addition can be manual or automatic, that is faster. The term ‘’tagging’’ refers to the addition of a code to each word in a corpus indicating the part of speech. The ‘’parsing’’ is the analysis of text into constituents, such as clauses and groups. A superordinate term for tagging and parsing is ‘’annotation’’, that is used to describe other kinds of information that can be added. It is often done manually.

 Positive and negative aspects of Corpora: Corpora are important for applied linguists but they also

have some limitations. A corpus is a way to storing data and tells us what language is like. They are a more reliable guide to language use than native speaker intuition is. Intuition is a poor guide: Some collocations are hard to intuit; although native speakers can often recognise if a phraseology is unusual, articulating the nature of the atypicality may be more difficult. The limitations are that a corpus will not give information about whether something is possible or not, only whether it is frequent or not; it can shows only his contents; it can offer evidence but cannot give information; perhaps most seriously a corpus presents language out of its context.

Methods in corpus linguistics:

Concordance lines are a useful tool to investigate corpora but they are limited, so there are other methods of investigating corpora beyond concordance lines.

 Frequency and key-word lists: A frequency list may be order according to the frequency, the

alphabetical order or in the order of the first occurrence of the type in the corpus. Comparing lists of two corpora may give important information and we can see that there are particular words related to a list, so they occur more frequently in a list than another. These words are called keywords.

 Measurement of collocation: Like we said before, collocation is the tendency of words to be biased in

the way they co-occur. To calculate a collocation, each programme takes a node word and counts the words next to it. Three of the most commonly used measures of significance of co-occurrence are: Mutual Information score, T-score and Z-score (the first and the last are more similar in term of output, the second and the last in term of how they’re calculated). MI and T score both depend on two calculation: how many instances of the co-occurring word are found related to the node word and the expectation of instances in that span given the frequency of the co-occurring word. T score is calculated by subtracting Expected from Observed and dividing the result by the standard deviation. MI score is the Observed divided by the Expected, converted to a base2-logarithm. The differences between these two scores are: MI is a measure of strength of collocation, T score of certainty collocation; MI score doesn’t depend on the size of the corpus, for the T score the size of the corpus is important; MI score can be compared across corpora, T score no; MI score tends to give

Corpora in applied linguistics, Appunti di Lingua Inglese

Spesso scaricati insieme

Documenti correlati

Anteprima parziale del testo

Scarica Corpora in applied linguistics e più Appunti in PDF di Lingua Inglese solo su Docsity!

Corpora in Applied Linguistics: introduction

 Frequency : The words in a corpus can be arranged in order of their frequency in that corpus. Corpora

 Phraseology: It is observed through concordance (concordance lines bring together many instances of

 Collocation: it is the statistical tendency of words to co-occur, f.e.: collocates of shed include: light, tears,

 A corpora is used : for teaching a language (it can give information about how a language works and

 There are different types of corpora that depend on their purpose:

 There are some key terms:

 Type: Counting each repeated item once only.

 Positive and negative aspects of Corpora: Corpora are important for applied linguists but they also

Methods in corpus linguistics:

 Frequency and key-word lists: A frequency list may be order according to the frequency, the

 Measurement of collocation: Like we said before, collocation is the tendency of words to be biased in