Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Introduction to Corpus Linguistics, Sbobinature di Lingua Inglese

This document provides an introduction to corpus linguistics, a field that has revolutionized language study in recent decades. It discusses the advantages and limitations of using corpora, the types of corpora, and the methods and techniques employed, such as concordance lines, collocation analysis, and corpus annotation. The document highlights how corpora can be used for language teaching, translation, and establishing usage norms. It also explores the technical terminology and advanced methodologies in corpus linguistics.

Tipologia: Sbobinature

2022/2023

Caricato il 26/05/2024

rosita-veri
rosita-veri 🇮🇹

14 documenti

1 / 15

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
CORPORA IN APPLIED
LINGUISTICS
Chapter 1. INTRODUCTION TO A CORPUS IN USE
What this book is about
It is no an exaggeration to say that corpora, and the study of corpora, have
revolutionized the study of language, and of the applications of language, over
the last few decades.
The improved accessibility of computers has changed corpus study from a
subject for specialists only to something that is open to all.
The aim of the book is to introduce students of applied linguistics to corpus
investigation. Its topic is , for the most part, studies that have been carried out
on corpora in English, and much of the focus of the book relates to corpora
used in English language teaching. Other applications, such as translation and
investigations of ideology are also included. The large amount of the work that
has been carried out on languages other than English is not covered by this
book.
The book deals with a range of issues, there are two themes that run
consistently through it. One is the effect of corpus studies upon theories of
language and how language should be described. Corpora allow researchers
not only to count categories in traditional approaches to language but also to
observe categories and phenomena that have not been noticed before. The
other major theme is a critical approach to the method used in investigating
corpora, and a comparison between them. Corpus findings can be seductive,
and it is important to be aware of the possible pitfalls in their production.
This book is intended for people who are interested in how language, more
specifically English, works, and how a knowledge about language can be
applied in certain real-life contexts. It is expected that the reader will wish to
carry out corpus investigations for him or herself and will need to become
acquainted with the range of research that has been carried out in the field.
It deserves asking two questions about the title of the book: what is a
corpus? What is applied linguistics?
A corpus is defined in terms of both its form and its purpose. Linguistics have
always used the word corpus to describe a collection of naturally occurring
examples of language, consisting of anything from a few sentences to a set of
written texts or tape recordings, which has been collected for linguistic study.
Recently, the word has been reserved for collections of texts (or parts of texts)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Anteprima parziale del testo

Scarica Introduction to Corpus Linguistics e più Sbobinature in PDF di Lingua Inglese solo su Docsity!

CORPORA IN APPLIED

LINGUISTICS

Chapter 1. INTRODUCTION TO A CORPUS IN USE

What this book is about

It is no an exaggeration to say that corpora, and the study of corpora, have revolutionized the study of language, and of the applications of language, over the last few decades. The improved accessibility of computers has changed corpus study from a subject for specialists only to something that is open to all. The aim of the book is to introduce students of applied linguistics to corpus investigation. Its topic is , for the most part, studies that have been carried out on corpora in English, and much of the focus of the book relates to corpora used in English language teaching. Other applications, such as translation and investigations of ideology are also included. The large amount of the work that has been carried out on languages other than English is not covered by this book. The book deals with a range of issues, there are two themes that run consistently through it. One is the effect of corpus studies upon theories of language and how language should be described. Corpora allow researchers not only to count categories in traditional approaches to language but also to observe categories and phenomena that have not been noticed before. The other major theme is a critical approach to the method used in investigating corpora, and a comparison between them. Corpus findings can be seductive, and it is important to be aware of the possible pitfalls in their production. This book is intended for people who are interested in how language, more specifically English, works, and how a knowledge about language can be applied in certain real-life contexts. It is expected that the reader will wish to carry out corpus investigations for him or herself and will need to become acquainted with the range of research that has been carried out in the field. It deserves asking two questions about the title of the book: what is a corpus? What is applied linguistics? A corpus is defined in terms of both its form and its purpose. Linguistics have always used the word corpus to describe a collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings, which has been collected for linguistic study. Recently, the word has been reserved for collections of texts (or parts of texts)

that are stored and accessed electronically. Because computers can hold and process large amounts of information, electronic corpora are usually larger than small, paper-based collections previously used to study aspects of language. A corpus is planned, though chance may play a part in the text collection, and it is designed for some linguistic purpose. The specific purpose of the design determines the selections of texts, and the aim is other than to preserve the texts themselves because they have intrinsic value. This differentiates a corpus from a library or an electronic archive. The corpus is stored in such a way that it can be studied non-linearly, and both quantitively and qualitatively. The purpose is not simply to access the texts in order to read them, which again distinguishes the corpus from the library and the archive. The field of applied linguistics itself has undergone something of a revolution over the last few decades. Once, it was almost synonymous with language teaching but now it covers any application of language to the solution of real- life problems. The difference between linguistics and applied linguistics is not simply that one deals with theory and the other with applications of those theories. Applied linguistics has tended to develop language theories of its own, one that are more relevant to the questions applied linguistics by theoretical linguistics. Corpora are adding to the development of those applied views of language. What a corpora can do and how corpora are used in applied linguistics. This is followed by an account of the main types of corpora and an introduction to some of the terminology used in this book. Advantages and limitations of using corpora in language study.

What a corpus can do

Simply speaking, a corpus by itself can do nothing at all, being nothing other than a store of used language. Corpus access software can re-arrange that store so that observations of various kinds can be made. If a corpus represents, partially, a speaker’s experience of language, the access software re-order that experience so that it can be examined in ways that are usually impossible. A corpus does not contain new information about language, but the software offers us a new perspective on the familiar. Most readily available software packages process data from a corpus in three areas: showing frequency ; phraseology , and collocation. Each of these will be exemplified in this section.

Frequency

The words in a corpus can be arranged in order of their frequency in that corpus. This is most interesting when corpora are compared in terms of frequency lists. Frequency lists from corpora can be useful for identifying possible differences between the corpora that can then be studied in more detail. Another approach is to look at the frequency of given words, compared across corpora.

Corpora nowadays have a diverse range of uses:

  • For language teaching, corpora can give information about how a language works that may not be accessible to native speakers intuition, such as the detailed phraseology mentioned above. In addition, the relative frequency of different features can be calculated. According to Mindt nearly all the future time reference in conversational English is indicated by will or others modals. The phrase BE going to accounts for about 10% of future time reference, and the present progressive less than 5%. Information such as this is important for syllabus and materials designs.
  • Language classrooms teachers are encouraging students to explore corpora for themselves, allowing them to observe nuances of usage and to make comparisons between languages.
  • Translators use comparable corpora to compare the use of apparent translation equivalents in two languages, and parallel corpora to see how words and phrases have been translated in the past.
  • General corpora can be used to establish norms of frequency and usage against which individual texts can be measured. This has applications for work in stylistics and in clinical and forensic linguistics.
  • Corpora are used also to investigate cultural attitudes expressed through language and as a resource for critical discourse studies.

Types of corpora

A corpus is always designed for a particular purpose, and the type of corpus will depend on its purpose. Here are some commonly used corpus types:

  • Specialised corpus. A corpus of text of a particular type, such as newspapers editorials, geography textbooks, academic articles in a particular subject, lectures, causal conversations, essays written by students. It aims to be representative of a given type of text. It is used to investigate a particular type of language. Researches often collect their own specialised corpora to reflect the kind of language they want to investigate. There is no limit to the degree of specialisation involved, but the parameters are set to limit the kinds of texts included. A corpus might be restricted to a time frame, consisting of texts from a particular century, or a social setting, such as conversation taking place in a bookshop, or to a given topic, such as newspapers articles dealing with the European Union.
  • General corpus. A corpus of texts of many types. It may include written or spoken language, or both, and may include texts produced in one country or many. It is unlikely to be representative of any particular “whole”, but will include as wide as spread of texts as possible. A general corpus is usually much larger than a specialised corpus. It may be used to produced reference materials for language learning or translations, and it is often used as baseline in comparison with more specialised corpora. Because of this secondo functions it is also sometimes called a reference corpus.
  • Comparable corpus. Two (or more) corpora in different languages or in different varieties of a language. They are designed along the same line. For example they will contain the same proportion of newspapers texts, novels, casual conversations, and so on. Comparable corpora of varieties of the same language can be used to compare those varieties. Comparable corpora of different languages can be used by translators and by learners to identify differences and equivalences in each language. The ICE corpora are comparable corpora of 1 million words each of different varieties of English.
  • Parallel corpora. Two (or more) corpora in different languages, each containing texts that have been translated from one language into the other or texts that have been produced simultaneously in two or more languages. They can be used by translators and by learners to find potential equivalent expressions in each language and to investigate differences between languages.
  • Learner corpus. A collection of texts – essays, for example – produced by learners of a language. The purpose of this corpus is to identify in what respects learners differ from each other and from the language of native speakers, for which a comparable corpus of native-speakers texts is required.
  • Pedagogic corpus. A corpus consisting of al the language a learner has been exposed to. For most learners, their pedagogic corpus does not exist in physical form. If a teacher does decide to collect a pedagogic corpus, it can consist of all the course books, readers, a learners has used, plus any tapes they have heard. The term “pedagogic corpus” is used by Willis. A pedagogic corpus can be used to collect together for the learner all instances of a word or phrase they have come across in different contexts, for the purpose of raising awareness. It can also be compared with a corpus of naturally occurring English to check that the learner is being presented with language that is natural-sounding and useful.
  • Historical or diachronic corpus. A corpus of texts from different periods of time. It is used to trace the development of aspects of a language over time.
  • Monitor corpus. A corpus designed to track current changes in a language. A monitor corpus is added to annually, monthly or even daily, so it rapidly increases in size. The proportion of text types in the corpus remains constants, so that each year is directly comparable with every other.

Some key terms

The literature on corpora makes use of a certain amount of technical terminology. It may be helpful to explain a few of the most essential terms here. Eight terms will be explained: type, token, hapax, lemma, word-form , tag, parse and annotate.

Type, token, hapax

A tagger may be used to make calculations of proportion of word use. Corpus parsing is the analysis of the text into constituents, such as clauses and groups. A parsed corpus can be used to count with great accuracy the number of different structures in a corpus. Parsing can be done automatically, but the resulting output is often not very accurate. Accuracy can be improved by training the automatic parser; that is, by setting up the parser to learn from past examples. A small corpus is parsed and edited manually and the resulting output is used to train the automatic parser. Parser of this level of sophistication have been developed by Leech and his colleagues at Lancaster University and, though the process is somewhat big, a high level of accuracy is achieved. Where total accuracy is required, however, where the parsed text is being used to teach human learners how to do grammatical parsing, manual editing is still needed. A superordinate term for tagging and parsing is annotation. Annotation is also used to describe other kinds of information that can be added to a corpus. Annotation can be done manually.

Why corpora? Why not?

Corpora are described as a tool. It might be more proper to say that corpora are a way of collecting and storing data, and that is the corpus access programs that are the tools. A corpus tell us what language is like, and the main argument in favour of using a corpus is that it is a more reliable guide to language use than native speaker intuition is. Intuition is a poor guide to at least four aspects of language: collocation, frequency, prosody and phraseology.

  • Judgements about collocations: some collocations are easy to intuit, other are more difficult. Granger points out that some adverbs collocate with particular adjectives.
  • Judgement about frequency: it is almost impossible to be conscious of the relative frequency of words, phrases and structures excepts in very general terms.
  • Semantic prosody and pragmatic meaning: Channel makes the point very strongly that many instances of pragmatic meaning are beyond the reach of intuition.
  • Details of phraseology: native speakers can often recognise if a phraseology is unusual, articulating the nature of the atypically may be more difficult. An over-reliance on intuition can be criticised, it would be incorrect to argue that intuition is not important. It is an essential tool for extrapolating important generalisations from a mass of specific information in a corpus.
  1. A corpus will not give information about whether something is possible or not, only whether it si frequent or not.
  2. A corpus can show nothing more that its own contents. It may claim to be representative, all attempts to draw generalisations from a corpus are in fact extrapolations.
  3. A corpus can offer evidence but cannot give information.
  4. Perhaps most seriously a corpus present language out of its context.

Chapter 4. METHODS IN CORPUS LINGUISTICS: BEYOND THE

CONCORDANCE LINE

Concordance lines are a useful tool for investigating corpora, but their use is limited to by the ability of the human observer to process information. Assessments of frequency and significance are difficult to make impressionistically, particularly in the case of very frequent words. They are not particularly useful in collecting information about categories of things. In these chapter we look at the methods of investigating corpora that go beyond concordance lines. These includes statistical calculations of collocation and corpus annotations. A distinction is made between methods which are based on individual words and those which are based on categories. The final section in this chapter discusses the implications of the two approaches, making the point that what is at issue is not only methodology, but the theoretical presuppositions that lead to that methodology. The word-based and the category-based approaches are used to answer different sets of questions, and may be evaluated in terms of the perceived usefulness of the questions.

Frequency and key-word lists

A frequency list is simply a list of all the types in a corpus together with the number of occurrences of each type. The list can be displayed in frequency order, in alphabetical order, or in the order of the first occurrence of the type in the corpus. Comparing the frequency lists for two corpora can give information about the differences between the texts comprising each one. This is useful when specialised corpora are being compared. Words which are significantly more frequent than another are sometimes known as keywords. The corpus investigation automatically compares two corpora – usually a smaller more specialised, one and a larger more general – and lists the keywords for the more specialised corpus. It can be used to list the different lexis between a small set of newspaper articles and a corpus of a newspaper texts. Keywords can be a useful starting point in investigating a specialised corpus.

Collocation

When looking at concordance lines, it is reasonable to make a selection of all the instances. When obtaining statistical information, the calculations should be made using all the available data. One use of collocational information is to highlight the difference meaning that a word has. What the simple collocation list cannot show is the association of meaning and phraseology. A somewhat different method of displaying collocational information can be used to obtain clues as the dominant phraseology of a word. This is something of a short-cut to the information that could be obtained from concordance lines. Collocations can be used to obtain a profile of the semantic field of the words. A list of collocates can be taken together, can be grouped into semantic areas. These include:

  • Words connected with wrong-doing;
  • Words connected with money;
  • Words connected with officialdom;
  • Words connected with sport;
  • Words connected with the legal process. Before leaving the topic of collocation information and fixed phrases, it is worth giving a warning about interpreting such information. It is tempting, when looking at a list of collocates, to draws conclusions about the overall frequency of compounds and phrases that may not be justified.

Tagging and parsing

Categories and annotations

Corpus annotation is the process of adding information to a corpus. As Leech points out, this information is designed to interpret the corpus linguistically by indicating the word-class of each of the words in it. The term “annotation” is used to cover tagging, parsing and other forms of annotation. Using annotations to explore a corpus is referred to here as a “category-based” methodology, because the parts of a corpus – the words, or phraseological units, or clause – are placed into categories and those categories are used as the basis for corpus searches and statistical manipulations. This is in contrast to the “word-based” methods described above. Everyone would agree that some degree of annotation is a useful feature of a corpus, especially a large corpus. In addition, many people regard extensive annotation as an essential step in the exploitation of corpora. Leech says: “corpus annotation is widely accepted as a crucial contribution to the benefit a corpus being, since it enriches the corpus as a source of linguistic information for the future research and development”. The idea that annotation adds value to a corpus, making it easier to retrieve information and increasing the range of investigations that can be done on the

corpus will be illustrated below. It is not the concern of this book to teach the reader how to annotate a corpus, for which specific software is often necessary. Instead the types of annotation used and the uses to which they are put are discussed. Before beginning the discussion of corpus annotation, it is worth pointing out that corpus investigations involving categories can be carried out on a non- annotated corpus. Halliday calculates the approximate proportions of positive and negative clauses and of present and post clauses in a large unannotated corpus.

Tagging

Tagging means allocating a part of speech label to each word in a corpus. The tag can be chosen to give general or specific information. A tagged corpus has various uses. Looking at the concordance lines for a word with several senses can be made much simpler if the word-class can be specified. Often the collocation of a word depend on its word-class. More sophisticated uses can be made of a tagged corpus. Perhaps most significantly, total occurences of word-classes in a particular corpus can be counted. The word-classes, rather than the individual words, that collocate with a given item can be listed. Finally, the frequency of sequences of tags can be calculated and corpora can be compared in this respect. Corpus tagging needs to be done automatically, that is, by a computer programmed to recognise parts of speech, otherwise the labour of adding tags by hand would outweigh the advantages of having them. Programs that assigns tags ( taggers ) tend to work on a mixture of two principles: rules governing word-classes and probability. When applying the rules fails to identify the word-class, many taggers use probability, based on the overall frequency of the word and word-class. Automatic taggers are usually claimed to have an accuracy rate of over 90%. It is important when using a tagged corpus, to remember that the tagger may be wrong, and the human user’s judgment is more reliable in individual cases. This is particularly so when a word is being used in a unusual way. Taggers work with some kind of lexicon or dictionary, which determines what parts of speech a given word is known to be. No other tag can be assigned to that word. If the lexicon is wrong, the tagger cannot possibly be right. A further point to note is that inaccuracies produced by a tagger are not usually spread evenly throughout a corpus. All the mistakes a tagger makes, will be clustered around words which have several possible tags.

Annotation of anaphora.

One of the most important features of the text is cohesion: the use of words and phrases in a text to refer to preceding or subsequent words and phrases. Some cohesive items are used to summarise, label or encapsulate chunks of discourse, thus playing a role in the organisation of text. The term anaphora is used in schemes which annotate the cohesion in texts, with the term anaphor being used for the cohesive item. Different systems of annotation can be used to analyse the anaphora in a text, but most do some or all of the following:

  • Identify an anaphor and its antecedent (the words or phrase it refers to), or establish whether an antecedent is identifiable or not;
  • Categories the antecedent;
  • Identify the direction of connection
  • Identify the type of anaphor
  • Note the distance between an anaphor and its antecedent. The most interesting aspects of this is the linking of each anaphora with its antecedent. This involves allocating a number to each antecedent, with the same number being given to the anaphora that refer to it. In this way it is possible to track the development of a text, showing which people or things are referred to most frequently and how the text is progressively chunked.

Semantic annotation

Semantic annotation refers to the categorisation of words and phrases in a corpus in terms of a set of semantic fields. Each words or multi-word item from a tagged corpus is matched against a lexicon in which the items are assigned to a semantic field. The outcome is a string of annotations which can then form the basis of calculations which in turn provide an analysis of the content of each part of the corpus. In an application of this method, Thomas and Wilson describe an analysis of interactions between doctors and patients in two clinics. Using the system of semantic classification devised, the computer calculates the most frequent meaning made by doctors, health workers and patients. Thomas and Wilson’s researches again illustrates a synergy between methods.

“How a meaning is made”

The final kind of annotation to be mentioned here is a variation on semantic annotation. This is partial annotation, in that only certain categories are selected. Calculations can then be made in terms of how frequently a meaning is made in a number of registers, and how the meaning is most frequently made in each registers.

Corpus annotation of this kind provides a basis for approaching a corpus from the point of view of meaning first and can be linked with a notional approach to language teaching. A local grammar attempts to describe the resources for only one set of meaning in a language, rather than for a language as a whole.

Issues in annotation

The accounts of various annotation project given above indicate something of the range of applications of corpus annotation. Many research projects depend on particular kinds of annotation, and it may seem rude to suggest that annotation could be anything other than advantageous. There are several interlinked issues connected with annotation which are worth consideration. There are three basic methods of annotating a corpus – manual, computer- assisted, automatic – but only the second two are suitable for use on anything but the smallest corpora. In automatic annotation, the computer works alone, following whatever rules and algorithms the programmer has determined. Once the program has been written, a corpus of any size can be annotated relatively quickly. On the other hand, an automatic annotation program is unlikely to produce results that are 100% in accordance with what a human researcher would produce; in other words, there are likely to be errors. A computer-assisted annotation program allows the human researcher to edit the computer-generated output or provided an interactive interface a manual annotation can be performed with the minimum of effort. Computer-assisted annotation is slower than automatic, allowing for less corpus material to be annotated, but it is likely to be more accurate. The difficulty in automating most annotation procedures has implications for corpus design and research. Access to large, tagged corpora is easily available, but the amount of parsed corpus data publicly available only within specific research projects. A heavily annotated corpus, even an edited parsed corpus, is a valuable commodity and cannot lightly be discarded. To a certain extent the benefits to be found in corpus data – the availability of large amounts of current material – tend to be undermined by annotation.

Competing methods

This survey of methods of investigating a corpus has dealt with concordance lines, collocational statistics and corpus annotation moving from the least to the most complex. It began with methods that are based on word-forms alone and the co-occurrence of sets of words, and moved to methods that put words into categories. The preference for prioritising words alone tends to go along with a preference for a plain text corpus, that is one with a minimum of annotation. The