Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Corpus Linguistics and Discourse Analysis: Patterns and Power, Sintesi del corso di Linguistica

Riassunto del testo USING CORPORA in DISCOURSE ANALYSIS di Paul Baker

Tipologia: Sintesi del corso

2015/2016

In vendita dal 22/11/2016

Clarissa.RFr
Clarissa.RFr 🇮🇹

4.4

(148)

178 documenti

1 / 17

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
USING CORPORA DISCOURS ANALYSIS
Paul Baker
1. INTRODUCTION
This book is about using corpora and corpus process in order to uncover linguistic patterns which
can enable us to moke sense of the ways that language is used in the construction of discourses.
Some people may know a lot about discourse analysis but not about corpus linguistic; for others the
opposite may be the case, for others still, both areas might be equally opaque. We will begin by
giving a description of corpus linguistic and discourse.
Corpus linguistic
Corpus linguistic is the study of language based on example of real life language use. Corpora are
generally large (consisting of thousands or even millions of words), representative sample of a
particular type of naturally occurring language, so they can therefore be used as a standard reference
with which claims about language can be measured. Electronic corpora are often annotated whit
additional linguistic information. Other types of information can be encoded within corpora, for
example in spoken corpora (containing transcript of dialogue) attributes such as sex, age, socio-
economic group and region can be encoded for each participant. This would allow language
comparasons to be made about different types of speakers. Up until the ’70 only a small number of
studies utilized corpus-based approaches and in the ’80 that corpus linguistics as a methodology
became popular. Between 1976-1991 corpus linguistic has been employed in a number of areas of
linguistic including dictionary creation, as an aid to interpretation of literary text, forensic linguistic,
language description, language variation studies and language teaching materials.
Discourse
The term discourse is used in social and linguistic research in a number of inter-related yet different
ways. In traditional linguistic it is defined a language above the sentence or above the clause. The
term discourse is also sometimes applied to different types of language use or topic, for example,
we can talk about political discourse, colonial discourse, media discourse and environmental
discourse. A number of researchers have used corpora to examine discourse styles of people who
are learners of English. Discourse can also be defined as practices which systematically form the
objects of which they speak. In order to expand, discourse is a system of statements which
constructs an object as a set of meanings, metaphors, representations, images, stories, statements
and so on that in some way together produce a particular version of events. Therefore, discourses
are not valid descriptions of people’s beliefs or opinions and they cannot be taken as representing an
inner aspect of identify such as personality or attitude. They are connected to practices and
structures that are lived out in society from day to day. Discourses can therefore be difficult to pin
down or describe they are constantly changing, interacting whit each other breaking off and
merging. One way that discourses are constructed is via language. Language is not the same as
discourse, but we can carry out analyses of language in texts in order to uncover traces of
discourses.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Anteprima parziale del testo

Scarica Corpus Linguistics and Discourse Analysis: Patterns and Power e più Sintesi del corso in PDF di Linguistica solo su Docsity!

USING CORPORA DISCOURS ANALYSIS

Paul Baker

1. INTRODUCTION

This book is about using corpora and corpus process in order to uncover linguistic patterns which

can enable us to moke sense of the ways that language is used in the construction of discourses.

Some people may know a lot about discourse analysis but not about corpus linguistic; for others the opposite may be the case, for others still, both areas might be equally opaque. We will begin by

giving a description of corpus linguistic and discourse.

Corpus linguistic

Corpus linguistic is the study of language based on example of real life language use. Corpora are

generally large (consisting of thousands or even millions of words), representative sample of a

particular type of naturally occurring language, so they can therefore be used as a standard reference

with which claims about language can be measured. Electronic corpora are often annotated whit

additional linguistic information. Other types of information can be encoded within corpora, for example in spoken corpora (containing transcript of dialogue) attributes such as sex, age, socio-

economic group and region can be encoded for each participant. This would allow language

comparasons to be made about different types of speakers. Up until the ’70 only a small number of

studies utilized corpus-based approaches and in the ’80 that corpus linguistics as a methodology

became popular. Between 1976-1991 corpus linguistic has been employed in a number of areas of

linguistic including dictionary creation, as an aid to interpretation of literary text, forensic linguistic,

language description, language variation studies and language teaching materials.

Discourse

The term discourse is used in social and linguistic research in a number of inter-related yet different

ways. In traditional linguistic it is defined a language above the sentence or above the clause. The

term discourse is also sometimes applied to different types of language use or topic, for example,

we can talk about political discourse, colonial discourse, media discourse and environmental

discourse. A number of researchers have used corpora to examine discourse styles of people who

are learners of English. Discourse can also be defined as practices which systematically form the

objects of which they speak. In order to expand, discourse is a system of statements which

constructs an object as a set of meanings, metaphors, representations, images, stories, statements and so on that in some way together produce a particular version of events. Therefore, discourses

are not valid descriptions of people’s beliefs or opinions and they cannot be taken as representing an

inner aspect of identify such as personality or attitude. They are connected to practices and

structures that are lived out in society from day to day. Discourses can therefore be difficult to pin

down or describe – they are constantly changing, interacting whit each other breaking off and

merging. One way that discourses are constructed is via language. Language is not the same as

discourse, but we can carry out analyses of language in texts in order to uncover traces of discourses.

The shift to post-structuralism

Discourse analysts have used corpora in order to analyse data such as political texts, teaching

materials, scientific writing and newspaper articles. Such studies have shown how corpus analysis can uncover ideologies and evidence for disadvantage. Corpus-based techniques have been

employed in studies which have attempted to analyse difference in language usage based on

identity. There are a small number of researchers who are applying corpus methodologies in

discourse analysis, this is still a cross-disciplinary field which is somewhat under-subscribed, and

appears to be subject to some resistance. All methods of research have associated problems which

need to be addressed and are also limited in terms of what they can and can not achieve. One

criticism of corpus-based approaches is that they are too broad. More researches have

problematized corpora as constituting linguistic applied rather than applied linguistics for ex. Widdowson claims that corpus linguistics only offers a partial account of real language because it

does not address the lack of correspondence between corpus finding and native speaker intuitions.

Others researchers should encourage corpus-based work which takes into account potential

problems, perhaps supplementing their approach whit other methodologies. There is no reason why

corpus-based research on lexical items should not use diachronic corpora in order to track changes

in word meaning and usage over time and several large-scale corpus building projects have been

carried out whit the aim of creating historic corpora from different time periods.

Corpus linguistics also tend to be conceptualized as a quantitative method of analysis. Before the

1980, corpus linguistics had struggled to make an impact upon linguistic research because

computers were not sufficiently powerful enough or widely available to put the theoretical

principles into practice. By the 1980, an alternative means of producing knowledge has become

available, roughly based around the concept of post-modernism and referred to as post-structuralism

or social constructionism. Post-structuralists have developed close formulations between the

concepts of language, ideology and hegemony, based on the work of a lot of writers (for example

Gramsci). One area that corpus linguistics has excelled in has been in generating descriptive grammars of languages based on naturally occurring language use, but focusing on language as an

abstract system. Corpus linguistics approach can be perceived as equally time consuming. Large

numbers of texts must first be collected, while their analysis often requires learning how to use

computer programs to manipulate data: the access to corpora is not always easy and it is often

simply less effort to collect a smaller sample of data which can be transcribed and analysed by

hand, without the need to use computers or mathematical formulae.

Advantages of the corpus-based approach to discourse analysis

Reducing researcher bias (Bias: pregiudizio, predilezione/errore, falsità).

Older empirical views were concerned with the removal of reseracher bias in favour of empiricism

and objectivity, post.modern forms of research have argued that unbiased researcher is in itself a

“discourse of science”. Biases should be a prerequisite for carrying out reporting research.

The term critical realism is useful it outlines an approach to social research which accepts that we

perceive the world from a particular viewpoint, but the world act back on us to constrain the ways

that we can perceive it. A lot of academic discourse is written in an impersonal, formal style, so

introducing some sort of personal statement may still seem jarring, particularly in some disciplines.

type of words which occur much more frequently than in 1960 corpus. In addiction, we can find

that certain terms have become less frequent (girl, Mr, Mrs) were more popular in 1960 than they

were in 1990, suggesting that perhaps sexist discourses or formal ways of addressing people have become less common. May be that a word is no more or less frequent than it used to be, but its

meanings have changed over time. For ex. in the early 1960 the word “blind” appears in a literal

sense referring to people or animals who cannot see, and in the 1990 corpus it being used in a range

of more metaphorical (and negative) ways.

Triangulation

Tognini and Bonelli (2001) makes a useful distinction between corpus-based and corpus-driven investigations. A corpus-driven analysis proceeds in a more inductive way (the corpus is the data and the patterns in it are noted as a way of expressing regularities in language. Triangulation is a term coined by Newby in 1977 and is now accepted by most researchers. There are several advantages of triangulation: it facilitates validity check of hypotheses, it anchors finding in more interpretation and explanations, and it allows researchers to respond flexibly to unforeseen problems and respects of their research.

Some concerns Corpus linguistics are a useful method of carrying out discourse analysis, there are still a few concerns which are necessary to discuss. First, corpus data is usually only language data (written or transcribed spoken) and discourses are not confined to verbal communication. Discourses can be embedded within images, for ex. pictures of heterosexual couples often occur in advertising. In many cases discourses can be produced via interaction between verbal and visual texts. The social condition of production and interpretation of texts are important in helping the researcher understand discourses surrounding them. Researchers may choose to interpret a corpus-based analysis of language in different ways, depending on their own positions. For ex. people from socially disadvantages groups tend to use more non standard language and taboo terms than those form more advantaged group: in this case terms helping to show identity group membership. A corpus-based analysis will tend to place focus on patterns. Frequent patterns of language do not always necessarily imply underlying hegemonic discourses. The power of individual texts or speakers in a corpus may not be evenly distributed. General corpora are often composed of data from numerous sources (newspaper, novel, letters, etc.). We way be able to annotate texts in a corpus to take into account aspects of production and reception, such as author occupation/status or readership, but this will not always be possible. A hegemonic discourse can be most powerful when it does not even have to be invoked, because it is just taken for granted. A corpus based analysis of language is only one possible analysis out of many, and open to contestation. It is an analysis which focuses on norms and frequent patterns within language. There can be analyses of language that go against the norms of corpus data. Corpus linguistic does not provide a single way of analyzing data. There are numerous ways of making sense of linguistic patterns: collocations, keywords, frequency list, clusters, dispersion plots, etc. We may decide, for ex., to investigate for co-occurences in a corpus in relation to how discourses are formed.

2. CORPUS BUILDING Introduction One the potential problems with using corpora in the analysis of discourse is decontextualized data. The relationship between different texts in a corpus or between sentences in the same file may be obscured in quantitative analyses. The process of finding and selecting texts, obtaining permissions, transferring to electronic format, checking and annotating files may also provide the researcher with

initial hypotheses as certain patterns are noticed, and such hypotheses could form the basis for the first stages of corpus research.

Some types of corpora Corpora can be categorized into types. In term of discourse analyses the first most important type of corpus is called a “ specialized corpus ” used to study aspects of a particular variety or genre of language. A good example of a specialized corpus would be the Michigan Corpus of Academic Spoken English; the texts in this corpus consisting of transcript of spoken language recorded in academic institutions across America. It may be useful to make a distinction between corpora and text archives or database. An archive is generally defined as being similar to a corpus but the difference between an archive and a corpus must be that the latter is designed for a particular “representative” function. An archive or database is simply a text repository, often huge and opportunistically collected, and normally not structured. Corpora tend having a more balanced. Archives or databases may contain all of the published work of a single author, or all of the edition of a newspaper from a given years. An aspect of traditional corpus is in sampling: many corpora are composed of a variety of texts, of which samples are taken. This technique of sampling is in place to ensure that the corpus is not skewed by the presence of a few very large single text taken from the same source. For the purpose of discourse analysis, it may a good idea to built a corpus which includes samples taken at different point from complete texts. Regarding using corpora for discourse analysis, it is possible to carry out corpus-based analyses on much smaller amounts of data. If we are interested in examining a particular genre of language, then it is not usually necessary to built a corpus consisting of millions of words, especially if the genre is linguistically restricted in some way. One consideration when building a specialized corpus in order is how often we would expect to find that subject mentioned within it. Therefore, when building a specialized corpus for the purposes of investigation a particular subject or set of subject, we way want to be more selective in choosing our texts, meaning that the quality or content of the data takes equal or more precedence over issues of quantity. An aspect of corpus based analysis than can often be extremely useful in terms of analyzing discourses is the process of checking changes over time. A diachronic corpus in a corpus which has been built in order to be representative of a language or language variety over a particular period of time, making it is possible for researchers to track linguistic changes within it. A diachronic corpus may not be able to fully take into account language change, and it can introduce a more dynamic aspect into corpus based analysis. A reference corpus is what purist would refer to when they use the term “corpus”. It consist of a large corpus: usually consisting of millions of words from a wide range of texts.

Capturing data There are also good reasons for building a specialized corpus. One of the easiest ways to collect corpus texts is to use data which already exist in electronic format. For ex. the United Kingdom Parliament website contains full transcripts of daily debates from the British House of Lords and house of Commons. There are a lot of internet archived: Bibliomania, the Oxford Text Archive, the Electronic text Centre, etc. It is also possible to save files in other format which retain the images, styles and layout of the page. A problem with saving files is that we need to assume that all of the language data we are collecting is going to be recognizable in plain text, which is not always the case. One problem with saving the entire page from a website address is that we way end up with unwanted text such as menus, titles or links to other pages. Once the site has been copied, it may still be necessary to strip the files of unwanted text in any case, and some websites are constructed in order to prevent copiers from taking their content in this way.

Grammatical annotation is one procedure that is commonly assigned to corpora at some stage towards the end of the building process; can be useful in that it enables corpus users to make more specific analyses. The important point is that different form of annotation are often carried out on corpora and can result in more sophisticated analyses of data but that this is not compulsory.

Using a reference corpus Obtaining access to a reference corpus can be helpful for two reasons: first, reference corpora are large and representative enough of a particular genre of language, that they can themselves be used to uncover evidence of particular discourses; secondly, a reference corpus acts as a good benchmark of what is normal in language by which your own data can be compared to. We can compare a large reference corpus to a smaller corpus in order to examine which words occur in the smaller text more frequently than we would normally expect them to occur by chance alone. The access to a reference corpus is potentially useful for carrying out discourse analysis, even if the corpus itself is not the main focus analysis. Perhaps more problematic issue is to do with gaining access to corpora. researchers will be at an advantage. some corpus builders allow user limited access for a trial period before buying a smaller sample of their corpus.

3 FREQUENCY AND DISPERSION Introduction Frequency is one of the most concept underpinning the analysis of corpora. Frequency list can be employed to direct the researcher to investigate various parts of a corpus, how measures of dispersion can reveal trends across texts and how frequency data can help to give the user a sociological profile of a given word or phrase enabling greater understanding of its use in particular contexts. Related to the concept of frequency is that of dispersion.

Join the club Frequency and dispersion can be employed in a small corpus of data for example in a corpus which consists 12 leaflets advertising holidays published in 2005 with the goal to investigate discourses of tourism. Holidays brochures are interesting text type to analyse because they are an inherently persuasive from of discourse. their main aim is to ensure that potential customers will be sufficiently impressed to book holiday.

Frequency counts Using the corpus analysis software WordSmith, a word list of the 12 text files was obtained. A word list is a list of all of the words in a corpus along with their frequencies and percentage contribution that each word makes towards the corpus. The most frequent words in the corpus are grammatical words: pronoums, determiners, conjunctions, prepositions. There are words describing holiday residences (studios, facilities, apartments), and other attractions (beach, pool, club).

Considering clusters We need to consider frequencies beyond single words. using WordSmith it is possible to derive frequency lists for clusters of words. BAR and CLUB are the most frequent lexical lemmas in the holiday corpus, they are also only lemmas in the top ten that relate to alcohol. We can consider another class of words: verbs, that they play a particularly role in tourist discourse. In holiday corpus the most frequently verbs are imperative verbs clusters.

Dispersion plots Another way of looking at the word is to think about where it occurs within individual texts and within the corpus as a whole. A dispersion plot gives a visual representation of where a search term

occurs in the corpus. The plot has also been standardized so that each file in the corpus appears to be of the same length. this is useful in that it allows us to compare where occurrences of the search term appears, across multiple files.

Comparing demographic frequencies By examining the frequency list the most frequent informal terms in the corpus were collected and are presented in a table where it necessary to explore the context of some words in detail, in order to remove occurrences that were not used in a colloquial or informal way. Most of the terms occurred more often in written, rather than spoken British English. the spoken text tend to contain of the more informal meanings of the words. For the authors of the holiday leaflets to use informal language in order to index youthful identities we need to assume that they believed that such language was typical of this identity and that the target audience would also read the leaflets in the same way. By using a form language which is strongly associated with youthful identities the audience may feel that they are been spoken to in a narrative voice that they would find desirable or at least are comfortable with. The use of colloquialisms also contributes to normalization of certain types of youthful identities. it suggest a shared way of speaking for young people, who do not use informal language may be alerted to a discrepancy between their linguistic identities and those of people featured in the brochure.

Conclusion The analysis of frequent lexical lemmas revealed some of the most important concepts in the corpus and a more detailed analysis of clusters and individualize incidences containing these termes revealed some if the ways that holidaymakers were constructed. By investigating how hight frequency informal language occurred in a reference corpus of spoken British English, we were able to gain evidence in order to create hypotheses about how the readership of the holidays leaflets were constructed.

4 CONCORDANCES Introduction A concordance analysis is one of the most effective techniques which allows researchers to carry out a sort of close examination. A concordance is simply a list of all the occurrences of a particular search term in a corpus and is also sometimes referred to as key word in context or KWIC, althought it should be noted that "key word in context" has a different meaning to the concept of key words. Here key word simply means the word that is currently under examination and that can be any word that takes the interest of the researcher. In order to demonstrate how concordances can be of use to discourse analysis we need to carry out an examination of a new set of data, a corpus of newspaper articles that are one of the easiest text types to collect. The relative ease in which newspaper data can be appropriated for corpus use suggests that it should be employed whit care rather than overused; newspaper data is very useful area of producing and reproducing discourses. Journalist are able to influence readers by producing their own discourses or helping to reshape existing one. texts can only take on meaning when consumers intercat with them. Discourses within newspapers are usually the result of collaboration between multiple contributors and single articles may express a variety of views on the same object. When using a corpus of newspaper articles it is important to bear in mind that the processes of production and reception of any particular article are complex and multiple.

Investigation discourses of refugees Refugees are a particularly interesting subject to analyses in term of discourse because they consist of one of the most relatively powerless group in society. One aspect of this conceptualization of discourse relating to ways of looking at the world is that it enables or encourages a critical

Step-by-step guide to concordance analysis

  1. Build or obtain access to a corpus;
  2. Decide on the search term (e.g. refugee) bearing in mind that search can be expanded to include plurals, euphemisms, anaphora and proper nouns of relevant individuals.
  3. obtain a concordance of the search term(s);
  4. Clean the concordances by removing repetitions or other lines are not relevant;
  5. Sort he concordance repeatedly on different words to the left and right while looking for evidence of grammatical, semantic or discourse patterns;
  6. Look for further evidence of such patterns in the corpus;
  7. Investigate the precence of particular terms;
  8. When no more patterns can be found, carry out a close analysis of the remaining concordance lines;
  9. Note are or non-existent cases of discourses based on your own intuitions;
  10. attempt to hypothesize why the patterns appear and relate this to issue of text production and reception.

5 COLLOCATES

Introduction Carrying out a close analysis of search terms via a concordance can be helpful in revealing traces of discourses within texts; concordance can be in some cases can consist of hundreds or even thousands lines. Researchers can rely on sampling methods which are helpful in reducing the lenght of time spent on analysis, but a problem is that may also fail to reveal salient aspects of the concordance. Another problem is that patterns are not always as clear-cut in a concordance as we would like them to be. In the British National Corpus all words co-occur with each other to some degree. When a word regularly appears near another word, and the relationship is statistically significant then such co-occurences are referred to as collocates and the phenomena of certain words frequently occurring next to or near each is collocation. Collocation is a way of understanding meanings and associations between words which are otherwise difficult to ascertain from a small-scale analysis if a single text. Words can take on meaning by the context that they occur in. To Explore how discourse analysis can be carried out by focusing primarily on collocation. In order to carry out a linguistic analysis it is useful to examine the usage of words in a corpus. In the British National Corpus, a large corpus, we can view that it as being more or less representative of general British English.

Deriving collocates There are a number of different procedure of collocation calculated. The simplest is to count the number of times a given word appears within, say a 5 words window to the left or right of a search term. If we use this procedure we get a list of words. One the problem with this technique is that hight frequency words generally tend to be function words which does not always reveal much of interest, particularly in term of discourse. A number of statistical tests take into account the frequency of words in a corpus and their relative number of occurrences both next to and away from each other. One such test is called Mutual Information (MI). Mutual information is calculated by examining all of the places where two potential collocates occur in a text or corpus. An algorithm then computes what the expected probability of these two words occurring near to each other.

Identifying discourses from collocates The word bachelor occurs more frequently in the corpus, than more spinThe word bachelor occurs more frequently in the corpus, than more spinster. Examining concordances which contain bachelor along with these collocates it is clear that they all relate to having a degree (e.g. bachelor of arts).

Here the meaning of bachelor (a type of degree) is different to the meaning we are concerned with (a man who has not married). Homonyms are a rare and accidental phenomenon. Polysemy, where two words with the same spelling have interrelated meanings are much more common. While the collocates of bachelor which suggest a meaning of university education no longer have the same association with bachelor as unmarried man, the two meanings are perhaps due to historical polysemy rather than being accidental homonyms. What we seen with the strongest collocates of bachelor is a somewhat dualistic picture of discourse. A young bachelor receives a positive discourse prosody connected to living a happy, possibility urban existence. This is supported by an analysis of the collocates days, life, eligible, and party. The positive discourse prosody is tied to the fact a bachelor life is expected to be a short-term situation. When bachelor becomes a long-term state, then it is viewed as more problematic: repeatedly characterized in a corpus by poverty, eccentricity, old age and loneliness. There is an implication than there is something wrong or unfortunate about a man who goes through his whole life without marrying.

Resistant discourses

A collocational analysis has shown us some of the most salient discourses and different ways of referring to bachelors and spinsters. A collocational analysis is useful for two reasons. First it provides a focus for our initial analysis which is particularly helpful when a large number of concordance lines need to be sorted multiple times in order to reveal lexical patterns. Secondly, it gives us the most salient lexical patterns surrounding a subject from which a number of discorses can be obtained. When two words frequently collocate, there is evidence that the discourses surrounding them are particularly powerful perhaps to the point where even one half of the pair is likely to prime someone who hears or reads that words to think of the other half. Collocates can act as triggers, suggesting unconscious associations which are ways that discourses can be maintained. Corpus data gives us one way of understanding language, based on what is typical. Collocates may also contain traces of resistant discourses, which are worth exploring in the remaining concordance lines.

Collocational networks We have tended to consider collocates individually or we have looked at groups of collocates together because their meanings are similar (days/life/living). This methodology based on researcher interpretation did prove to be productive. Collocates are useful in that they help to summarize the most significant relationship between words in a corpus. This can be incredibly time-saving and give analysts a clear focus. Collocates are also useful in helping to spell put mainstream discourses while a closer analysis of them can reveal resistant discourses too. It is important that we do not over interpret collocational data. We should check the context that collocates occur in by examining concordances in more detail. There are different methods of calculating collocation and different results. We have considered collocates of bachelor and spinster n a corpus of general British English and we focused on a particular genre of text (novels, newspapers, etc).

Step-by-step guide to collocational analysis.

  1. Build or obtain access to a corpus; 2 Decide on a search term (e.g. bachelor) bearing in mind that the terms can be expanded to include plurals or other forms, euphemisms, anaphora or relevant proper nouns;
  2. obtain a list of collocates;
  3. Decide how many collocates you want to look at and decide to “clean” the collocates lists by removing proper nouns or grammatical words;
  4. Can the collocates be groups semantically, thematically or grammatically a basis for the order;
  5. Obtain concordances of the collocates and look for patterns within the context;

The lemma MAKE seem to be a relatively important collocate of "criminal". Looking at the concordance of the word "criminal" there are other concordance lines which suggest a similar pattern, but do not include MAKE. Terms like "invoke" or "impose" are rhetorical strategies used to a particular discourse position. The word "dogs" occurs 182 times in the speech of the anti-hunters and 74 times in the speech of those who want hunting to remain legal. A concordance of the "use of dogs" was carried out for the whole corpus. The keyword list has given us a small number of words to examine and once the proper nouns have been discounted this leaves us with just 16 words in total. Finally consider another used by pro hunt speakers: practices. This word is interesting because it is difficult to determine exactly what it means. it occurs as a plural (veterinary practises, slaugther practises ans livestock practices, etc.). This term is therefore used to refer to a multitude of technique connected to animals, or it is also creates an association between non-lethal ways of dealing with animals.

Using a reference corpus So far our keywords analysis has been based on the idea that there are 2 sides to the debate and that by comparing one side against another we are likely to find a list of keywords which will then act as signpost to the underlying within the debate on fox-hunting. Our analysis so far has uncovered some interesting difference between the 2 sides of the debate. We need to separating all of the speech in the different debates into different files. the task of creating these files can be off-putting and in any case not always necessary. In term of proportions taking into account the relative size of the sub-corpora the anti-hunt speakers actually used, for ex. the term "cruelty" less than pro-hunters. Examining this word in more detail, it becomes apparent that although it occurs with a frequency on each side of the debate. Comparing a smaller corpus or set of texts to a larger reference corpus, is therefore a useful way of determining key concepts across the smaller corpus as a whole. For many studies where the text or set of text under scrutiny is relatively uniform, using a reference corpus may be all that is needed. using a reference corpus may be useful in revealing those words that are under represented in the data. When comparing a smaller corpus with a reference corpus, WordSmith also gives a list of all the negative keywords and this list doesn't take into account word which appeared zero times in the small corpus. Negative keywords can help to show topics or words of style which are not favoured in a corpus, which in itself can be illuminating.

Key clusters Another way of spotting words which occur frequently in 2 comparable sets of text but may be used for different purposes is to focus on key clusters of words. using WordSmith it is possible to derive wordlists of clusters of words, rather than single words. WordSmith allows the user to specify the size of the cluster under examination generally the larger the cluster size we specify the fewer the number of key cluster that are produced. Taking a cluster size of three, a list of key clusters was obtained by comparing the speech of pro-hunters with those were against hunting. This list contained some interesting cluster. When reporting the analysis of keyness, it is worth mentioning dispersion, particularly in cases like this where dispersion brings up something unexpected. This requires a more close analysis of words and phrases in the corpus, rather than simply recounting frequencies from wordlists.

Key categories A simple key list will reveal differences between sets of texts or corpora, it is sometimes the case that lower frequency words will not appear in the list because they do not occur often enough to make a sufficient impact. This may be a problem as low frequency synonyms tend to be overlooked in a keyword analysis. Finding key categories could help to point the existence of particular discourse types, they would be a useful way of revealing discourse prosodies. In order for such analyses to be carried out it is necessary to undertake the appropriate forms of annotation. The

automatic sematic annotation system used to tag the fox-hunting corpus was the USAS (UCREL Semantic Analysis System). Tags can be assigned a number of plus or minus codes to show where meaning resides on a binary or linear distinction. Once the semntic annotation had been carried out, word lists of the sides of the fox-hunting debate were created and compared with each other to create a keyword list. From this list, the relevant key semantic tags were singles out for analysis.

Possible uses of keywords A keywords analysis can therefore be used to compare 2 or more sides of an argument as in political debates or it could simply be used to compare the linguistic styles of different speakers. A keyword analysis can also be carried out on texts which are from different genres. Keywords taken from comparing 2 sets of party political texts were examined in order to diachronic change between traditional Labour and the values of the New Labour party headed by Tony Blair in the UK. New Labour keywords included "partnership, new, deal, business, etc. suggesting a more managerial style of government which focused on business interests and competition. British English contain more time and order oriented keywords: afterwards, yesterday, again, last, secondly. Keywords not only point to the existence of the discourse, but they help to reveal the rhetorical techniques that are used in order to present discourses as common sense or the correct ways of thinking.

Conclusion A keyword list is a useful tool for directiong researchers to significant lexical differences between texts. Carrying out comparisons between 3 or more sets of data, grouping infrequent keywords according to discursive similarity, showing awareness of keywords or dispersion plots, carrying out analyses on key cluster will enable researchers to obtain a more accurate picture of how keywords function in texts. Keywords can reveal a great deal about frequencies in texts which is unlikely to be matched by researchers intuition. As with all statistical methods, how the researchers choose to interpret the data is ultimately the most important aspect of corpus-based research.

7 BEYOND COLLOCATION

Introduction In this chapter we focus on aspects of discourse analysis which are more concerned with grammatical rather than lexical patterns. We will be considering a number of ways that a more grammar based analysis can be of value to researchers looking at discourse via corpora. We look at a single term, the lemma ALLENGE and its forms. The analysis of this lemma was inspired by reading of an article on a news website about an alleged rape. The article acted as a springboard, raising a number of questions about ALLENGE. A corpus analysis would help us to establish whether or not the patterns of language found in the article are typical or atypical of general English usage. The verb allege and its related forms, is therefore a key aspect in the discursive construction of stories about rape.

Nominalization Nominalization involves a process being converted from a verb or adjective into a noun or a multi- noum compound (e.g. discover --> discovery, solve --> solution). Nominalizations often involve reductions or deletetions in some way. In the BNC, the word “allenge, allenging, alleged, alleges, allegendly, allegation and allegations” collectively occur domain. They also occur much more often in written to be spoken texts than written or spoken texts. The lemma "allege" is associated also with a variety of forms of news reporting.

Metaphor We have seen that the word chois allegations is of particular salience to the news article, because it carried a strong association with denial. In the article it was used in a direct quote by spokeswoman for the person who the allegations are being made about, but it was not used by the actual narrative voice of the article, with the more neutral term alleged occuring 3 times instead. We have used a reference corpus to look at collocation of the word allegations, as well as patterns or modal use and the presence or absence of various types of actor associated with allegations rape. Another way of understanding some of the hidden associations of the word allegation is to consider it in terms of metaphor. Metaphors are a particularly revealing way of helping to reveal discourses surrounding a subject. Looking at the precence of metaphors in a corpus and noting their relative frequencies to each other, should provide researchers with a different way of focusing on discourse. There isn't a simple way of carrying out a metaphor based analysis on a corpus: the researcher carries out a close reading of a sample of text in order to identify candidate metaphors; corpus context are examined to determine whether keywords are metaphoric or literal. In our corpus abstract concepts are often constructed via metaphors which reference concrete entities, and it is the case that allegation(s) will have metaphor in common with similar terms like "accusation" or "claim". The corpus not only helps to uncover the possible metaphors surrounding a word or concept, but it can also be useful in revealing how that metaphor works in a range of other cases, enabling researchers to gain a greater understanding of its meaning. We see allegations referred to in terms of heavy, weight, violence, penetration, waste, fire, flight and horses. Some of these metaphors appear to be more frequent than others. The term allegation is found in a range of general metaphorical patterns in British English it is not possible to say that any single metaphor dominates the way than we think of the term.

Further directions We could have expanded our analysis of the term allegations to consider other linguistic phenomena like a range of lexical, semantic and grammatical features or we may also to consider co-ordination. There are some techniques in critical discourse analysis which are more difficult to carry out on corpora. At present, a great deal of corpus bases discourse analysis is still focused at the lexical level. The challenge to future researchers is to find ways to make grammar and semantic based analysis of corpora a more feasible proposition.

8 CONCLUSION This book identified some of the most some of the most useful methodological techniques of corpus.based research (frequencies, collocations, keywords, concordance, dispertions) and show how they can be effectively used in the analysis of discouse. The main points about language and dicourse that our corpus based analysis have revealed:

  • Corpus based discourse analysis is not simply a quantitative procedure but one which involves a great deal of human choice at every stage: research questions, designing and building corpora, deciding which techniques to use, interpreting the results and framing explanations for them.
  • Attitudes and discourse are embedded in language via our cumulative, lifelong exposure to language patterns and choises: collocations, semantic and discourse prosodies.
  • We are often unconscious of the patterns of language we encounter across our lifetime, but corpora are useful in identifyng them: they emulate and reveal this cumulative exposure.

Corpus building The design and availablility of corpora are paramount to its analysis. Diachronically, language and society are constantly changing and discourses are changing as well. there is an urgent need to build more up-to-date corpora in order to reflect this passing of time. Some aspects of language use do not change as rapidly as others. The contents of the BNC are a testament to the way that people

wrote and spoke in the early 1990. using corpora of texts that were created decades or centuries ago will help researchers to explore the ways that language was once used, shedding light on the reason behind current meanings, collocations and discourse prosodies of particular words phrases or grammatical constructions. Comparing a range of corpora from different historic time periods will give us a series of linguistic "snap-shot" which will allow discourses to appear to come to life. An aspect of corpus building which is particularty relevant for discourse analysis is the fact that context is so important. Corpora that include both the electronic text only with annotation from and the original texts would be useful for making sense of individual texts within them. In the case of newspaper or magazine articles it would be useful to make references back to the original page(s) so we could note aspect such as font size and style. colors, layout and visuals.

Corpus analysis It is important that a corpus based analysis will not give researchers a list of discourses around a subject. The analysis will point to patterns in language which must then be interpreted in order to suggest the existence of discourse. The corpus based analysis can only show what is in the corpus, although it may be a far reaching analysis, it can never be exhaustive. Corpora are so large and we may be tempted to think that our analysis has covered every potential discursive construction around a given subject. The wide variety of altenative statistical avaible to the corpus user might mean that data an be subtly massaged in order to reveal results that are interestnig, controversial or confirm our suspicions. When using a general corpus, issue surrounding the variety types of reduction and reception for all of the texts within, can become highly problematic. One option could be recognize that the general corpus consist of a multitude of voice and to use such data sparingly instead carrying out the analysis of discourses on more specialized corpora, where issues of production and reception can be more easily articulated. Another possibility could simply be to argue from a perspective that society is inter-connected and all texts influence each other. A corpus based analysis of discourse affords the researchers with the patterns and trends in language. People are not computers though and their ways of interacting with texts are very different, both from computers and from each other. Corpus based discourse analysis should play an important role in term s of removing bias, testing hypotheses, identifying norms and outiliers and raising new research questions.