SUSAN HUNSTON. CORPORA IN APPLIED LINGUISTICS

CHAPTER 1.

Corpora, and the study of corpora, have revolutionized the study of language and of the

applications of language. Two of the main issues this book deals with are the eﬀect of corpus

studies upon theories of language and how languages should be, and a critical approach to the

methods used in investigating corpora and a comparison between them.

CORPUS. A corpus is a collection of naturally occurring examples of language, consisting of

anything from a few sentences to a set of written texts or tape recordings. Recently, the word has

been used to refer to a collection of texts stored and accessed electronically: it is stored in such a

way that it can be studied non-linearly and both quantitatively and qualitatively.

APPLIED LINGUISTICS. The ﬁeld of applied linguistics was once a synonym of language teaching,

but now refers to any application of language to the solution of real-life problems; it has tended to

develop language theories of its own.

A corpus by itself cannot do anything, it is nothing more than a store of used language. However,

corpus access software oﬀers a new perspective on the familiar information we have from the

corpus about language.

FREQUENCY. The words can be arranged in a corpus in order of their frequency. Frequency lists

can be useful for identifying possible diﬀerences between the corpora that can be studied in

more detail. Another approach would be to look at the frequency of given words, compared across

corpora.

Biber and his colleagues have undertaken more sophisticated work on comparative frequencies

between registers: they use software which counts not only words but also categories of linguistics

items. One of the main examples among their calculations is the distribution of present and past

tenses across four registers: conversation, ﬁction, academic and news. Each register has its own

ratio; of course, if the proportion of present to past is dependent on the register, then the

proportion in a large corpus will depend on the balance of registers within that corpus. We have

no idea how to calculate proportions for English as a whole, therefore we have equally no idea

what would constitute a corpus that truly reﬂects English.

PHRASEOLOGY. We can access a corpus through a concordancing program. Concordance lines

bring together many instances of use of a word or phrase. It can be used as an alternative view of

phenomena that teachers of English are frequently called upon to explain. Kennedy, for example,

through his observation is able to provide a proﬁle of each word that relates each aspect of

meaning to typical phraseologies, and to assign frequencies to the diﬀerent meanings (or

semantic functions).

COLLOCATION. It is the statistical tendency of words to co-occur; a list of the collocates of a given

word can yield similar information to that provided by concordance lines, but it processes more

information more accurately. It can indicate a pair of lexical items or the association between a

lexical word and its frequent grammatical environment (colligation).

Anteprima parziale del testo

Scarica Susan Hunston. Corpora in Applied Linguistics, ch. 1 and 4 e più Schemi e mappe concettuali in PDF di Lingua Inglese solo su Docsity!

SUSAN HUNSTON. CORPORA IN APPLIED LINGUISTICS

CHAPTER 1.

Corpora , and the study of corpora, have revolutionized the study of language and of the applications of language. Two of the main issues this book deals with are the effect of corpus studies upon theories of language and how languages should be, and a critical approach to the methods used in investigating corpora and a comparison between them. CORPUS. A corpus is a collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings. Recently, the word has been used to refer to a collection of texts stored and accessed electronically: it is stored in such a way that it can be studied non-linearly and both quantitatively and qualitatively. APPLIED LINGUISTICS. The field of applied linguistics was once a synonym of language teaching, but now refers to any application of language to the solution of real-life problems; it has tended to develop language theories of its own. A corpus by itself cannot do anything, it is nothing more than a store of used language. However, corpus access software offers a new perspective on the familiar information we have from the corpus about language. FREQUENCY. The words can be arranged in a corpus in order of their frequency. Frequency lists can be useful for identifying possible differences between the corpora that can be studied in more detail. Another approach would be to look at the frequency of given words , compared across corpora. Biber and his colleagues have undertaken more sophisticated work on comparative frequencies between registers : they use software which counts not only words but also categories of linguistics items. One of the main examples among their calculations is the distribution of present and past tenses across four registers: conversation, fiction, academic and news. Each register has its own ratio; of course, if the proportion of present to past is dependent on the register, then the proportion in a large corpus will depend on the balance of registers within that corpus. We have no idea how to calculate proportions for English as a whole , therefore we have equally no idea what would constitute a corpus that truly reflects English. PHRASEOLOGY. We can access a corpus through a concordancing program. Concordance lines bring together many instances of use of a word or phrase. It can be used as an alternative view of phenomena that teachers of English are frequently called upon to explain. Kennedy , for example, through his observation is able to provide a profile of each word that relates each aspect of meaning to typical phraseologies, and to assign frequencies to the different meanings (or semantic functions ). COLLOCATION. It is the statistical tendency of words to co-occur; a list of the collocates of a given word can yield similar information to that provided by concordance lines, but it processes more information more accurately. It can indicate a pair of lexical items or the association between a lexical word and its frequent grammatical environment ( colligation ).

Corpora are used for: LANGUAGE TEACHING. It can give information about how language works that is not always accessible to native speaker intuition; language classroom teachers encourage students to explore corpora for themselves. TRANSLATION. Translators use comparable corpora to compare the use of apparent translation equivalents in two languages, and parallel corpora to see how words and phrases have been translated in the past. with general corpora TO ESTABLISH norms of frequency and usage against which individual texts can be measured. TO INVESTIGATE cultural attitudes expressed through language and as a resource for critical discourse studies. Types of corpora: SPECIALIZED CORPUS. Corpus of texts of a particular type (newspaper editorials, geography textbooks, lectures, casual conversations, etc.). It is representative of a given type of text and used to investigate a particular type of language. There is no limit to the degree of specialization involved, but there are parameters to limit the kind of texts included. GENERAL CORPUS. Corpus of texts of many types: it may include written and/or spoken language, texts produced in one country or many. It is unlikely to be representative of any particular whole. It is usually much larger than a specialized corpus; it may be used to produce reference materials for language learning or translation, often used as a baseline in comparison with more specialized corpora ( reference corpus ). COMPARABLE CORPORA. Two or more corpora in different languages or varieties of a language. They can be used by translators and by learners to identify differences and equivalences. PARALLEL CORPORA. Two or more corpora in different languages, containing texts that have been translated from one language to the other, or texts that have been produced simultaneously in two or more languages. They are used by translators and learners to find potential equivalent expressions and investigate differences. LEARNER CORPUS. Collection of texts produced by learners. Its purpose is to identify in what respects learners differ from each other and from the language of native speakers. PEDAGOGIC CORPUS. All the language a learner has been exposed to. The term was used by Willus, and it can be used to collect together for the learner all instances of a word or phrase they have come across in different contexts to raise awareness. HISTORICAL OR DIACHRONIC CORPUS. Corpus of texts from different periods of time, used to trace the development of aspects of a language. MONITOR CORPUS. Corpus used to track current changes in a language, added annually, monthly or even daily. TOKEN. Sequences of letters separated by spaces or punctuation. A figure that can be given by the word-count function of a word-processing program- TYPE. Words that occur more than once. HAPAXES. Words that occur only once. LEMMA. For example, unit and units are two word-forms of the same lemma. What is counted as a lemma depends on what use the idea is to be put to.

CHAPTER 4.

Concordance lines are a useful tool for investigating corpora, but their use is limited by the ability of the human observer to process information. They are not particularly useful in collecting information about categories of things. FREQUENCY LIST. List of all types in a corpus and the number of occurrences of each type. It can be in alphabetical order, frequency order or in the order of the first occurrence of the type in the corpus. KEYWORD. Words that are significantly more frequent in one corpus than in another. The corpus investigation package Wordsmith Tools includes a program which automatically compares two corpora and lists of keywords. Scott says that keywords can be lexical items which reflect the topic of a particular text and also be grammatical words which convey more subtle information. COLLOCATION. Tendency of words to be biased in the way they co-occur. It may be observed informally in any instance of language, but it is more reliable to measure it statistically in any instance of language, and for this corpus is essential. It can be considered as the tendency of two words to co-occur, or as the tendency of one word to attract another. Any program which calculates collocation takes a node word and counts the instances of all words occurring within a particular span. The chance collocation will be shown to be insignificant overall when compared with the meaningful ones, thus the reason for obtaining large quantities of data. In a list of raw frequencies it is impossible to attach a precise degree of importance to any of the figures in it. It is possible to calculate the significance of each co-occurrence. Three of the most commonly used measures of significance are: Mutual Information score ( MI ) T-score Z-score. MI-score and t-score depend on how many instances of the co-occurring word are found in the designated span of the node word (the Observed ) and how many instances might be expected in that span, given the frequency of the co-occurring word in the corpus as a whole (the Expected ). T-score uses a calculation of standard deviation, taking into account the probability of co-occurrence of the node and its collocate and the number of tokens in the span in all lines. T-score is calculated by subtracting Expected from Observed and dividing the result by the standard deviation. MI-score is the Observed divided by the Expected, converted to a base-2 logarithm. MI-score indicates the strength of a collocation, comparing the actual co-occurrence of the two items with their expected co-occurrence. It measures the amount of non-randomness present when two words co-occur. However, Burrows and Stubbs state that words do not occur randomly. The expected co-occurrence to be calculated if distribution were random should take into account only the grammatically possible co-occurrences; the difficulty lies in determining the grammar only probabilities.

If a word occurs lately, but in most of its few occurrences appears in the proximity of another word, the collocation between those words will obtain a high MI-score. It measures how strongly two words seem to associate in a corpus. However, knowing the strength of collocation is not always a reliable indication of meaningful association, for we also need to know how much evidence there is for it. T-score calculation takes the amount of evidence into account. MI-score is a measure of strength of collocation , because it is not particularly dependent on the size of the corpus. It can be compared across corpora if they are of different sizes. It gives information about the lexical behavior of a word and about the more idiomatic co-occurrences. The collocates with the highest MI-scores tend to be less frequent words with restricted collocation. T-score is a measure of certainty of collocation , because it depends on the size of the corpus. Absolute t-scores cannot be compared across corpora because the size of the corpus will affect it. It tends to give information about the grammatical behavior of a word. The collocates with the highest t-scores tend to be frequent words that collocate with a variety of items. CLAUSE COLLOCATION. Calculations of collocation in some instances may require a wider span than is commonly used. It is the tendency of one kind of clause to co-occur with another. COLLOCATIONAL STATISTICS. Summarizes some information found in concordance lines, allowing more instances of a word to be considered. It highlights the different meanings that a word has. The list of collocates gives a kind of semantic profile of the word involved. It cannot show the association of meaning and phraseology. A different method, a short-cut, of displaying collocational information can be used to obtain clues as to the dominant phraseology of a word. Collocations can be used to obtain a profile of the semantic field of a word: a list of collocates, taken together, can be grouped into semantic areas. This grouping of words gives information not only about the meaning, but an insight into some of the cultural ramifications of the concept. Calculations of collocation will always prioritize uses of a word that tend to be lexically fixed or restricted. It is important to recognize the importance of highly significant collocations but not to mis-interpret information about them. CORPUS ANNOTATION. Process of adding information. This information is designed to interpret the corpus linguistically. Using annotations to explore a corpus is referred to as category-based methodology, because the parts of a corpus are placed into categories, which are used as the basis for corpus searches and statistical manipulations. Many people regard extensive annotation as an essential step. Leech says: ❝Corpus annotation is widely accepted as a crucial contribution to the benefit a corpus brings, since it enriches the corpus as a source of linguistic information for future research and development. Annotation adds value to a corpus, makes it easier to retrieve information and increase the range of investigations.

SEMANTIC ANNOTATION. Categorization of words and phrases in terms of a set of semantic fields. Each word, or multi-word item, is matched against a lexicon in which the items are assigned to a semantic field. The outcome is a string of annotations that form the basis of calculations and provide an analysis of the content of each part of the corpus. PARTIAL ANNOTATION. Biber and Finegan describe it as a variation on semantic annotation, in that only certain categories are selected. Candidates for said categories are selected by the computer from a tagged and parsed corpus and the human researcher can accept or modify the categorization. This kind of corpus annotation provides a basis for approaching a corpus from the point of view of meaning first and can be linked with a notional approach to language teaching. Local grammar attempts to describe the resources for only one set of meanings in a language, rather than for the language as a whole. Three methods of annotating a corpus: MANUAL. COMPUTER-ASSISTED ; allows the human researcher to edit the computer-generated output and provides an interactive interface, so that annotation can be performed with minimum efforts. It is slower than automatic and can perform on less corpus material, but it is likely to be more accurate. AUTOMATIC ; the computer works alone following whatever rules and algorithms set by the programmer, and then a corpus of any size can be annotated relatively quickly. However, it is unlikely to produce results 100% in accordance with what a human researcher would produce. Only the second two are suitable on anything but the smaller corpora. The difficulty in automating most annotation processes is that access to large, tagged corpora is easily available but the amount of parsed corpus data publicly available is more limited. The work involved in annotation is a constraint against updating or enlarging a corpus. METHODS. No method of working is neutral with regard to theory. Category-based and word-based methods each answer different sets of questions. For example, a plain text corpus has obvious disadvantages in that certain categories cannot easily be counted. The added value given by annotation can be a double-edged sword. The corpus is more useful, but less readily updated, expanded or discarded; the categories used to annotate are typically determined before any corpus analysis is carried out, which limits the kind of questions that can be asked. Phenomena such as frames , semantic prosody and the pervasive influence of phraseology have been identified from plain text corpora and word-based studies. A reluctance to use categories that already exist in linguistics has led to a word-based practice of corpus investigation, which has also led to a revised theory of what language is like. ➢ However much annotation is added to a text, the basic patterning of the words alone must be observable at all times. ➢ It is important to be able to use ad hoc annotations as necessary. ➢ It is important to make the process as automatic as possible, so that the annotated corpus does not become over-valued , and the annotation is consistent. The corpus needs to be large enough to minimize the importance of inaccuracies in annotation. ➢ It is important, when reading about corpus work, to be critically aware of the methods being used and of the theories that lie behind them.

Susan Hunston. Corpora in Applied Linguistics, ch. 1 and 4, Schemi e mappe concettuali di Lingua Inglese

Spesso scaricati insieme

Documenti correlati

Anteprima parziale del testo

Scarica Susan Hunston. Corpora in Applied Linguistics, ch. 1 and 4 e più Schemi e mappe concettuali in PDF di Lingua Inglese solo su Docsity!

SUSAN HUNSTON. CORPORA IN APPLIED LINGUISTICS

CHAPTER 1.

CHAPTER 4.