Corpus Linguistics: An Overview of Methods and Applications - Prof. Sasso | Appunti di Lingua Inglese

CAPITOLO 7

DISCOURSE ANALYSIS

David Crystal

This chapter is fascinating because it shows us how corpus linguistics has changed the way

we look at discourse. In the past, researchers often worked with very small amounts of data

– maybe a couple of conversations or a short text – which of course made it difficult to

generalize their findings. With corpora, however, things are completely different. We can now

work with millions of words of authentic language, systematically collected and stored

electronically. This means we can observe recurring patterns, compare genres, and make

much stronger claims about how language is actually used.

What exactly is a corpus?

Well, simply put, a corpus is a collection of authentic spoken or written texts. The key thing is

that these texts are real, not invented for the study. And because they are stored

electronically, we can search them, count occurrences, and analyze them in many different

ways. For example, we can look at the collocates of a word – the words that tend to appear

next to it – or we can compare how often certain structures occur in spoken versus written

English. But here’s something important: corpora are not perfect mirrors of the entire

language. They reflect only the data they contain. So whenever we use a corpus, we always

need to be aware of its scope and its limitations.

Now, there are different kinds of corpora, and this is where things get interesting. On the one

hand, we have general corpora, which try to represent the language as a whole. These are

broad, often very large, and they give us an overview of English in general. For instance, we

could use a general corpus to compare how the words girl and lady are used, or how often

hedges like sort of or kind of appear. On the other hand, we have specialized corpora,

which are much more focused. These are built to represent a very specific text type, such as

academic lectures, student essays, or even casual conversations.

Let me give you some concrete examples of specialized corpora. One of the most famous

is MICASE, the Michigan Corpus of Academic Spoken English. It includes lectures,

seminars, office hours, and group discussions. Researchers have found that hedges like sort

of and kind of are more common in the humanities than in the sciences, and that discourse

markers like OK, so, and now often mark transitions in lectures. What I find really interesting

is that MICASE is not just a research tool: it’s also used in practice, for example to train

international teaching assistants or to develop oral presentation courses.

Another corpus is BASE, the British Academic Spoken English corpus. One fascinating

finding from BASE was about lecture speed and lexical density. Faster lectures tend to be

less dense, while slower lectures are more lexically dense. And that makes perfect sense:

when lecturers are covering difficult material, they slow down; when they tell anecdotes, they

speed up. The problem is that many textbooks for students don’t reflect this natural variation,

which can make real lectures harder to follow.

There is also BAWE, the British Academic Written English corpus. This one collects student

assignments from different disciplines, and it’s very rich in metadata: we know the discipline,

Corpus Linguistics: An Overview of Methods and Applications - Prof. Sasso, Appunti di Lingua Inglese