Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Corpus Linguistics: An Overview of Methods and Applications - Prof. Sasso, Appunti di Lingua Inglese

An overview of corpus linguistics, detailing its methodologies, types of corpora (general and specialized), and applications in discourse analysis and academic writing. It explores how corpora are used to analyze language patterns, study academic writing, and understand conversational discourse. The document also addresses criticisms of corpus studies and emphasizes the importance of combining corpus analysis with contextual data. Examples of specialized corpora such as micase, base, bawe, and the toefl 2000 corpus are provided, illustrating the practical applications of corpus linguistics in various academic contexts. Useful for students and researchers interested in language analysis and discourse studies, offering insights into both the theoretical and practical aspects of corpus linguistics.

Tipologia: Appunti

2023/2024

Caricato il 22/09/2025

ndusncsijals
ndusncsijals 🇮🇹

5

(3)

14 documenti

1 / 7

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
CAPITOLO 7
DISCOURSE ANALYSIS
David Crystal
This chapter is fascinating because it shows us how corpus linguistics has changed the way
we look at discourse. In the past, researchers often worked with very small amounts of data
– maybe a couple of conversations or a short text – which of course made it difficult to
generalize their findings. With corpora, however, things are completely different. We can now
work with millions of words of authentic language, systematically collected and stored
electronically. This means we can observe recurring patterns, compare genres, and make
much stronger claims about how language is actually used.
What exactly is a corpus?
Well, simply put, a corpus is a collection of authentic spoken or written texts. The key thing is
that these texts are real, not invented for the study. And because they are stored
electronically, we can search them, count occurrences, and analyze them in many different
ways. For example, we can look at the collocates of a word – the words that tend to appear
next to it – or we can compare how often certain structures occur in spoken versus written
English. But here’s something important: corpora are not perfect mirrors of the entire
language. They reflect only the data they contain. So whenever we use a corpus, we always
need to be aware of its scope and its limitations.
Now, there are different kinds of corpora, and this is where things get interesting. On the one
hand, we have general corpora, which try to represent the language as a whole. These are
broad, often very large, and they give us an overview of English in general. For instance, we
could use a general corpus to compare how the words girl and lady are used, or how often
hedges like sort of or kind of appear. On the other hand, we have specialized corpora,
which are much more focused. These are built to represent a very specific text type, such as
academic lectures, student essays, or even casual conversations.
Let me give you some concrete examples of specialized corpora. One of the most famous
is MICASE, the Michigan Corpus of Academic Spoken English. It includes lectures,
seminars, office hours, and group discussions. Researchers have found that hedges like sort
of and kind of are more common in the humanities than in the sciences, and that discourse
markers like OK, so, and now often mark transitions in lectures. What I find really interesting
is that MICASE is not just a research tool: it’s also used in practice, for example to train
international teaching assistants or to develop oral presentation courses.
Another corpus is BASE, the British Academic Spoken English corpus. One fascinating
finding from BASE was about lecture speed and lexical density. Faster lectures tend to be
less dense, while slower lectures are more lexically dense. And that makes perfect sense:
when lecturers are covering difficult material, they slow down; when they tell anecdotes, they
speed up. The problem is that many textbooks for students don’t reflect this natural variation,
which can make real lectures harder to follow.
There is also BAWE, the British Academic Written English corpus. This one collects student
assignments from different disciplines, and it’s very rich in metadata: we know the discipline,
pf3
pf4
pf5

Anteprima parziale del testo

Scarica Corpus Linguistics: An Overview of Methods and Applications - Prof. Sasso e più Appunti in PDF di Lingua Inglese solo su Docsity!

CAPITOLO 7

DISCOURSE ANALYSIS

David Crystal

This chapter is fascinating because it shows us how corpus linguistics has changed the way we look at discourse. In the past, researchers often worked with very small amounts of data

  • maybe a couple of conversations or a short text – which of course made it difficult to generalize their findings. With corpora, however, things are completely different. We can now work with millions of words of authentic language, systematically collected and stored electronically. This means we can observe recurring patterns, compare genres, and make much stronger claims about how language is actually used.

What exactly is a corpus? Well, simply put, a corpus is a collection of authentic spoken or written texts. The key thing is that these texts are real, not invented for the study. And because they are stored electronically, we can search them, count occurrences, and analyze them in many different ways. For example, we can look at the collocates of a word – the words that tend to appear next to it – or we can compare how often certain structures occur in spoken versus written English. But here’s something important: corpora are not perfect mirrors of the entire language. They reflect only the data they contain. So whenever we use a corpus, we always need to be aware of its scope and its limitations.

Now, there are different kinds of corpora, and this is where things get interesting. On the one hand, we have general corpora, which try to represent the language as a whole. These are broad, often very large, and they give us an overview of English in general. For instance, we could use a general corpus to compare how the words girl and lady are used, or how often hedges like sort of or kind of appear. On the other hand, we have specialized corpora, which are much more focused. These are built to represent a very specific text type, such as academic lectures, student essays, or even casual conversations.

Let me give you some concrete examples of specialized corpora. One of the most famous is MICASE, the Michigan Corpus of Academic Spoken English. It includes lectures, seminars, office hours, and group discussions. Researchers have found that hedges like sort of and kind of are more common in the humanities than in the sciences, and that discourse markers like OK, so, and now often mark transitions in lectures. What I find really interesting is that MICASE is not just a research tool: it’s also used in practice, for example to train international teaching assistants or to develop oral presentation courses.

Another corpus is BASE , the British Academic Spoken English corpus. One fascinating finding from BASE was about lecture speed and lexical density. Faster lectures tend to be less dense, while slower lectures are more lexically dense. And that makes perfect sense: when lecturers are covering difficult material, they slow down; when they tell anecdotes, they speed up. The problem is that many textbooks for students don’t reflect this natural variation, which can make real lectures harder to follow.

There is also BAWE , the British Academic Written English corpus. This one collects student assignments from different disciplines, and it’s very rich in metadata: we know the discipline,

the year of study, the gender of the student, and even the grade awarded. This means we can study how writing practices change across levels and subjects.

And then we have the TOEFL 2000 Corpus, which includes both spoken and written academic English from US universities. One important discovery was that spoken academic genres – for example, classroom teaching – are much closer to conversation than to written academic prose. In other words, lectures are not just “spoken writing”: they have their own interactive, conversational flavor.

Sometimes, though, researchers need to create their own corpora. For example, Ken Hyland built a corpus of Hong Kong student essays to study pronouns like I and we. He was interested in how students position themselves in their writing. Harwood created a corpus of journal articles to investigate how I and we are used in different disciplines. And Ooi compiled a corpus of online personal ads to study how people construct identities in that genre. What these cases show us is that corpora don’t always need to be enormous. Even small, carefully built corpora can be incredibly insightful.

But, of course, constructing a corpus raises a lot of questions. What kinds of texts should we include? Should they be spoken or written? Should we balance different genres? How large should the corpus be? And what about sociolinguistic variables – the age, gender, or nationality of speakers? Another important decision is whether the corpus should be static, like a snapshot of language at a given time, or dynamic, meaning updated regularly to capture changes. In practice, corpus design is rarely linear – it’s usually a cycle of planning, collecting, analyzing, and refining.

One of the largest and most influential corpora is the Longman Spoken and Written English corpus, or LSWE , which contains about 40 million words. It was the basis for the Longman Grammar of Spoken and Written English, and it covers conversations, fiction, news, academic prose, and more. The conversational data is especially extensive and diverse, representing different speakers and situations.

And what did LSWE tell us? Well, one of its key contributions was to show the distinctive nature of conversation. Conversation often uses non-clausal units like Right? or Why?, which are perfectly functional without a full subject and verb. It relies heavily on pronouns and ellipsis, since speakers share context and don’t need to say everything explicitly. It uses situational ellipsis, as in Common sense instead of It is common sense. There’s also a lot of repetition and many lexical bundles like you know, I mean, I’m just saying, which help speakers manage time and structure.

And then, of course, conversation is full of performance features: pauses, fillers like um and er, launchers like well and right, short responses like yeah or uh huh, and extended coordination with and or but. These reflect the fact that in conversation, we are planning and speaking at the same time.

Researchers have summarized this with three principles: keep the conversation going, plan only a little ahead, and qualify what you have said with clarifications or afterthoughts. That’s why conversation often sounds a bit fragmented – but that’s actually what makes it efficient and interactive.

SUMMARY

This chapter explains what corpora are, the different types we can use, and how they help us understand spoken and written discourse. It also deals with methodological issues, practical applications, and even criticisms of corpus studies.

  1. What is a corpus?

A corpus is basically a collection of authentic texts, spoken or written, which are stored electronically so that we can analyze them. What makes corpora powerful is that they are usually much larger than the small sets of data traditionally used in discourse analysis. With corpora, we can look at frequencies, collocations, or typical patterns in genres and registers. But we must always keep in mind the source and scope of a corpus, because not every corpus is fully representative of language in general.

  1. Kinds of corpora

We usually distinguish between general corpora and specialized corpora.

  • A general corpus represents the language in the broadest sense. For example, we can study collocates of girl and lady across English as a whole.
  • A specialized corpus, on the other hand, focuses on a specific domain or genre: for instance, student essays, academic lectures, or casual conversations.

  1. Examples of specialized corpora

Let me give you some examples:

  • MICASE, the Michigan Corpus of Academic Spoken English, contains data from many spoken academic genres. It has been used to study hedges like sort of or discourse markers like OK and so.
  • BASE, the British Academic Spoken English corpus, includes university lectures. One study showed that faster lectures tend to be less lexically dense, while slower lectures are more dense.
  • BAWE, the British Academic Written English corpus, collects student assignments in different disciplines, together with contextual information such as gender, year of study, and grade.
  • Finally, the TOEFL 2000 corpus covers both spoken and written academic language in US universities. One important finding was that classroom talk often resembles conversation more than academic writing.

  1. Designing and building corpora

Sometimes, if no corpus exists for a certain research question, scholars create their own. For example, Hyland studied personal pronouns in Hong Kong students’ writing; Harwood analyzed I and we in research articles; and Ooi collected online personal ads to examine linguistic features of that genre. Interestingly, even small corpora can be very effective if they are carefully targeted.

  1. Issues in corpus construction

There are several issues to consider:

  • What kind of texts should be included? Spoken or written? Monologic or dialogic?
  • How large should the corpus be, and how many texts of each type?
  • Sociolinguistic variables such as age, gender, or education of participants.
  • Another question is whether the corpus should be static—a snapshot of one moment in time—or dynamic, which means regularly updated to monitor change.

  1. The Longman Spoken and Written English Corpus

One of the most important corpora is the LSWE corpus, with about 40 million words. It includes conversation, fiction, news, academic prose, and more. It was the basis for the Longman Grammar of Spoken and Written English, which aimed to describe English grammar as it is really used, especially in conversation.

  1. Conversational discourse

Thanks to LSWE, we know much more about how conversation works. Some typical features are:

  • Non-clausal units, like Right? or Why?.
  • Heavy use of pronouns and ellipsis, since meaning is shared through context.
  • Repetition, often for emphasis or echoing what the other person said.
  • Lexical bundles like you know, I’m just saying, which help speakers hold the floor or organize what they are saying.

Conversation also shows performance phenomena: pauses, fillers like um and er, attention signals such as names, and short responses like yeah or uh huh. The general principle is to keep talking while planning ahead, which explains many of these features.

They provide us with large, reliable data to understand how people actually use language in both spoken and written contexts. They reveal typical features of conversation, help us understand academic writing, and even show us how social identities are constructed in discourse. At the same time, corpora must be used carefully, with an awareness of their limitations and the importance of context.