



Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
An overview of corpus linguistics, detailing its methodologies, types of corpora (general and specialized), and applications in discourse analysis and academic writing. It explores how corpora are used to analyze language patterns, study academic writing, and understand conversational discourse. The document also addresses criticisms of corpus studies and emphasizes the importance of combining corpus analysis with contextual data. Examples of specialized corpora such as micase, base, bawe, and the toefl 2000 corpus are provided, illustrating the practical applications of corpus linguistics in various academic contexts. Useful for students and researchers interested in language analysis and discourse studies, offering insights into both the theoretical and practical aspects of corpus linguistics.
Tipologia: Appunti
1 / 7
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!




David Crystal
This chapter is fascinating because it shows us how corpus linguistics has changed the way we look at discourse. In the past, researchers often worked with very small amounts of data
What exactly is a corpus? Well, simply put, a corpus is a collection of authentic spoken or written texts. The key thing is that these texts are real, not invented for the study. And because they are stored electronically, we can search them, count occurrences, and analyze them in many different ways. For example, we can look at the collocates of a word – the words that tend to appear next to it – or we can compare how often certain structures occur in spoken versus written English. But here’s something important: corpora are not perfect mirrors of the entire language. They reflect only the data they contain. So whenever we use a corpus, we always need to be aware of its scope and its limitations.
Now, there are different kinds of corpora, and this is where things get interesting. On the one hand, we have general corpora, which try to represent the language as a whole. These are broad, often very large, and they give us an overview of English in general. For instance, we could use a general corpus to compare how the words girl and lady are used, or how often hedges like sort of or kind of appear. On the other hand, we have specialized corpora, which are much more focused. These are built to represent a very specific text type, such as academic lectures, student essays, or even casual conversations.
Let me give you some concrete examples of specialized corpora. One of the most famous is MICASE, the Michigan Corpus of Academic Spoken English. It includes lectures, seminars, office hours, and group discussions. Researchers have found that hedges like sort of and kind of are more common in the humanities than in the sciences, and that discourse markers like OK, so, and now often mark transitions in lectures. What I find really interesting is that MICASE is not just a research tool: it’s also used in practice, for example to train international teaching assistants or to develop oral presentation courses.
Another corpus is BASE , the British Academic Spoken English corpus. One fascinating finding from BASE was about lecture speed and lexical density. Faster lectures tend to be less dense, while slower lectures are more lexically dense. And that makes perfect sense: when lecturers are covering difficult material, they slow down; when they tell anecdotes, they speed up. The problem is that many textbooks for students don’t reflect this natural variation, which can make real lectures harder to follow.
There is also BAWE , the British Academic Written English corpus. This one collects student assignments from different disciplines, and it’s very rich in metadata: we know the discipline,
the year of study, the gender of the student, and even the grade awarded. This means we can study how writing practices change across levels and subjects.
And then we have the TOEFL 2000 Corpus, which includes both spoken and written academic English from US universities. One important discovery was that spoken academic genres – for example, classroom teaching – are much closer to conversation than to written academic prose. In other words, lectures are not just “spoken writing”: they have their own interactive, conversational flavor.
Sometimes, though, researchers need to create their own corpora. For example, Ken Hyland built a corpus of Hong Kong student essays to study pronouns like I and we. He was interested in how students position themselves in their writing. Harwood created a corpus of journal articles to investigate how I and we are used in different disciplines. And Ooi compiled a corpus of online personal ads to study how people construct identities in that genre. What these cases show us is that corpora don’t always need to be enormous. Even small, carefully built corpora can be incredibly insightful.
But, of course, constructing a corpus raises a lot of questions. What kinds of texts should we include? Should they be spoken or written? Should we balance different genres? How large should the corpus be? And what about sociolinguistic variables – the age, gender, or nationality of speakers? Another important decision is whether the corpus should be static, like a snapshot of language at a given time, or dynamic, meaning updated regularly to capture changes. In practice, corpus design is rarely linear – it’s usually a cycle of planning, collecting, analyzing, and refining.
One of the largest and most influential corpora is the Longman Spoken and Written English corpus, or LSWE , which contains about 40 million words. It was the basis for the Longman Grammar of Spoken and Written English, and it covers conversations, fiction, news, academic prose, and more. The conversational data is especially extensive and diverse, representing different speakers and situations.
And what did LSWE tell us? Well, one of its key contributions was to show the distinctive nature of conversation. Conversation often uses non-clausal units like Right? or Why?, which are perfectly functional without a full subject and verb. It relies heavily on pronouns and ellipsis, since speakers share context and don’t need to say everything explicitly. It uses situational ellipsis, as in Common sense instead of It is common sense. There’s also a lot of repetition and many lexical bundles like you know, I mean, I’m just saying, which help speakers manage time and structure.
And then, of course, conversation is full of performance features: pauses, fillers like um and er, launchers like well and right, short responses like yeah or uh huh, and extended coordination with and or but. These reflect the fact that in conversation, we are planning and speaking at the same time.
Researchers have summarized this with three principles: keep the conversation going, plan only a little ahead, and qualify what you have said with clarifications or afterthoughts. That’s why conversation often sounds a bit fragmented – but that’s actually what makes it efficient and interactive.
This chapter explains what corpora are, the different types we can use, and how they help us understand spoken and written discourse. It also deals with methodological issues, practical applications, and even criticisms of corpus studies.
⸻
A corpus is basically a collection of authentic texts, spoken or written, which are stored electronically so that we can analyze them. What makes corpora powerful is that they are usually much larger than the small sets of data traditionally used in discourse analysis. With corpora, we can look at frequencies, collocations, or typical patterns in genres and registers. But we must always keep in mind the source and scope of a corpus, because not every corpus is fully representative of language in general.
⸻
We usually distinguish between general corpora and specialized corpora.
⸻
Let me give you some examples:
⸻
Sometimes, if no corpus exists for a certain research question, scholars create their own. For example, Hyland studied personal pronouns in Hong Kong students’ writing; Harwood analyzed I and we in research articles; and Ooi collected online personal ads to examine linguistic features of that genre. Interestingly, even small corpora can be very effective if they are carefully targeted.
⸻
There are several issues to consider:
⸻
One of the most important corpora is the LSWE corpus, with about 40 million words. It includes conversation, fiction, news, academic prose, and more. It was the basis for the Longman Grammar of Spoken and Written English, which aimed to describe English grammar as it is really used, especially in conversation.
⸻
Thanks to LSWE, we know much more about how conversation works. Some typical features are:
Conversation also shows performance phenomena: pauses, fillers like um and er, attention signals such as names, and short responses like yeah or uh huh. The general principle is to keep talking while planning ahead, which explains many of these features.
⸻
They provide us with large, reliable data to understand how people actually use language in both spoken and written contexts. They reveal typical features of conversation, help us understand academic writing, and even show us how social identities are constructed in discourse. At the same time, corpora must be used carefully, with an awareness of their limitations and the importance of context.