

































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This chapter introduces the concept of descriptive statistics in psychology, including levels of measurement, the normal distribution, measures of dispersion and central tendency, and graphical representations. It also covers using SPSS for data analysis.
Typology: Lecture notes
1 / 41
This page cannot be seen from the preview
Don't miss anything!


































Descriptive statistics
binary measures interval measures boxplot kurtosis categorical measures mean ceiling effect median central tendency mode continuous measures nominal measures descriptive statistics normal distribution discrete measures ordinal measures dispersion outliers distribution range exploratory data analysis ratio measures floor effect skew frequencies standard deviation histogram variable inter-quartile range
The purpose of descriptive statistical analysis is (you probably won’t be surprised to hear) to describe the data that you have. Sometimes people distinguish between descriptive statistics and exploratory data analysis. Exploratory data analysis helps you to understand what is hap- pening in your data, while descriptive statistics help you to explain to other people what is hap- penin in your data. While these two are closely related, they are not quite the same thing, and the best way of looking for something is not necessarily the best way of presenting it to others.
You should bear in mind that descriptive statistics do just what they say they will do – they describe the data that you have. They don’t tell you anything about the data that you don’t have. For example, if you carry out a study and find that the average number of times students in your study are pecked by ducks is once per year, you cannot conclude that all students are pecked by ducks once per year. This would be going beyond the information that you had.
Chance News 10.11 (http://www.dartmouth.edu/~chance/chance_news/chance_ news_10.11.html) cities an article by Marry Carmichael (Newsweek, 26 November 2001 p. 10), entitled ‘Harry Potter: What the real critics thought’: ‘Real critics,’ of course, refers to the kids who couldn’t get enough of the Harry Potter movie, not the professional reviewers who panned it as ‘sugary and over- stuffed.’ The article reports that: ‘On average, the fifth graders Newsweek talked to wanted to see it 100,050,593 more times each. (Not counting those who said they’d see it more than 10 times, the average dropped to a reasonable three.)’ What can you say about the data based on these summaries? How many children do you think were asked? What did they say? Write down your answers now, and when you’ve finished reading this chapter, com- pare them with ours (see page xxx).
1122 Understanding and Using Statistics in Psychology
say, but this does not happen with ordinal measures even when they are presented as numbers. If we imagine the final list of people who completed a marathon, it might be that the people who came 6th and 7th crossed the line almost together and so were only half a second apart, but the people who came 8th and 9th were miles away from each other so crossed the line several minutes apart. On the final order, however, they appear as next to each and the same distance apart as the 6th and 7th runner. Ratio measures are a special type of interval measure. They are a true, and meaningful, zero point, whereas interval measures do not. Temperature in Fahrenheit or Celsius is an interval measure, because 0 degrees is an arbitrary point – we could have made anywhere at all zero (in fact, when Celsius devised his original scale, he made the freezing point of water 100 degrees, and boiling point 0 degrees). Zero degrees Celsius does not mean no heat, it just refers to the point we chose to start counting from. On the other hand, tem- perature on the kelvin scale is a ratio measure, because 0 k is the lowest possible temper- ature (equivalent to – 273°c, in case you were wondering). However, it is not commonly used.) In psychology, ratio data are relatively rare, and we don’t care very often about whether data are interval or ratio.
Continuous measures may (theoretically) take any value. Although people usually give their height to a full number of inches (e.g. 5 feet 10 inches), they could give a very large number of decimal places – say, 5 feet 10.23431287 inches. Discrete measures can usually only take whole numbers so cannot be divided any more finely. If we ask how many brothers you have, or how many times you went to the library in the last month, you have to give a whole number as the answer.
What level of measurement are the following variables:
Answers are given at the end of the chapter.
1144 Understanding and Using Statistics in Psychology
Luckily for us, in psychology we don’t need to distinguish between discrete and continuous measures very often. In fact, as long as the numbers we are talking about are reasonably high, we can safely treat our variables as continuous.
There is an extra kind of data, that you might encounter, and that is continuous data which do not satisfy the interval assumption. For example, the Satisfaction With Life Scale (Diener, Emmons, Larsen & Griffin, 1985) contains five questions (e.g. ‘The con- ditions of my life are excellent’, ‘So far, I have gotten [sic] the important things from life’), which you answer on a scale from 1 (strongly disagree) to 7 (strongly agree). Each person therefore has a score from 5 to 35. It’s not quite continuous, in the strict sense of the word, but it is very close. We treat height as continuous, but people tend to give their height in whole inches, from (say) 5 feet 0 inches to 6 feet 6 inches. This has 30 divisions, the same as our scale. Should we treat this scale as interval? If we do this, we are saying that the differ- ence between a person who scores 5 and a person who scores 10 (5 points) is the same difference as the difference between a person who scores 30 and one who scores 35 – and by difference, we don’t mean five more points, we mean the same amount more quality of life, and we are not sure what that means. Should we treat this scale as ordinal? If we do, we are saying that a higher score just means a higher score. If one person scores 10, and another scores 20, we can just a say that the person who scored 20 scored ‘higher’. If a third person scores 21, we can only say that they scored higher still. We cannot say anything about the size of the gap from 10 to 20, and from 20 to 21. We can just say that it is higher. This does- n’t seem very sensible either. So what are we to do? One option is to use sophisticated (and difficult) methods that can deal with ordinal data more appropriately, and treat these data as a special kind of continuous data. These are, frankly, so difficult and frightening that we’re not going to even give you a reference (they even frighten us). Anyway, these methods usually can’t be used for variables that have more than (about) nine categories. The solution, used by almost everyone, almost all of the time, is to treat the measures as if they are continuous. It isn’t ideal, but it doesn’t actually seem to cause any problems.
We carry out a study and collect the data. We then want to describe the data that we have collected. The first thing to describe is the distribution of the data, to show the kinds of numbers that we have. Table 2.1 shows the extraversion scores of 100 students. We could present these data just as they are. This would not be very useful, but it would be very accurate.
Descriptive Statistics 1155
the data. For example, you can see that the high and low scores (extreme scores) have only a few individuals, whereas the middle scores (14, 15, 16, 17) have the most individuals. You’ll notice that the percentage scores are the same as the number of people. This has only happened because we had 100 people in the dataset, and usually the numbers would be different.
A chart can be a useful way to display data. Look at Figures 2.1, 2.2 and 2.3, and decide which one you think best represents the data. Figure 2.1 shows a histogram with a bin size of 1, which means that there is one score represented in each bar. This chart represents exactly the information that was shown in Table 2.2. Figure 2.2 shows a histogram with a bin size of 2, which means we have combined two sets of scores into one bar. We can see that a total of two people scored 4 or 5, and two people scored 6 or 7.
Before reading on, try to decide which of those two charts is the better.
Descriptive Statistics 1177
0 10 20 30 40
0
2
4
6
8
10
12
14
Frequency
Figure 2.1 Histogram with bin size 1
When we asked you to decide which was better, we raised the issue of summarising our data, and presenting our data accurately. You can’t argue with the fact that the first chart is accurate – it contains all of the information in the data. However, if we want to present our data accurately we use a table and present the numbers. A chart is used to present the pat- tern in our data. Using a bin size of 1 – that is, having each bar represent one point on the scale – leads to a couple of problems. First, when we have a very large number of points, we will have an awful lot of very thin stripes. Second, we are using a graph to show the pattern, and by using small bin sizes we get a very lumpy pattern, so we would rather smooth it a little by using larger bins. A different way of presenting the data is shown in Figure 2.3. This is a bar chart.
Statisticians have developed a number of formulae to determine the best number of bins. However, the best thing to do is to draw your histogram, see what it looks like, and then if you don’t like it, try a different bin size.
1188 Understanding and Using Statistics in Psychology
0 10 20 30 40
0
5
10
15
20
Frequency
Figure 2.2 Histogram with bin size 1
Stigler’s law of eponymy (Stigler, 1980) states that all statistical concepts which are named after someone, are named after someone who did not discover them. Gauss was not the first to mention the normal (or Gaussian) distribution – it was first used by De Moivre in 1733. Gauss first mentioned it in 1809, but claimed to have used it since 1794. It still got named after Gauss though. The sharper-eyed amongst you will have noticed that for Stigler’s law of eponymy to be correct, Stigler should not have first noted it. And, of course, he didn’t.
A very large number of naturally occurring variables are normally distributed, and there are good reasons for this to be the case (we’ll see why in the next chapter). A large number of statistical tests make the assumption that the data form a normal distribution. The histogram in Figure 2.4 shows a normal distribution. A normal distribution is symmetrical and bell-shaped. It curves outwards at the top and then inwards nearer the bottom, the tails getting thinner and thinner. Figure 2.4 shows a perfect normal distribution. Your data will never form a perfect normal distribution, but as long as the distribution you have is close to a normal distribution, this probably does not matter too much (we’ll be talking about this later on, when it does matter). If the dis- tribution formed by your data is symmetrical, and approximately bell-shaped – that is, thick in the middle and thin at both ends – then you have something close to a normal distribution.
2200 Understanding and Using Statistics in Psychology
Value
Frequency
Figure 2.4 Histogram showing the shape of a normal distribution
When we talk about a normal distribution, we are using the word ‘normal’ as a tech- nical term, with a special meaning. You cannot, therefore, refer to a usual distribu- tion, a regular distribution, a standard distribution or an even distribution. Throughout this book, we will come across some more examples of seemingly common words that have been requisitioned by statistics, and which you need to be careful with.
Some distributions are nearly normal but not quite. Look at Figures 2.5 and 2.6. Neither of these distributions is normal, but they are non-normal in quite different ways. Figure 2. does not have the characteristic symmetrical bell shape: it is the wrong shape. The second, on the other hand, looks to be approximately the correct shape, but has one or two awk- ward people on the right-hand side, who do not seem to be fitting in with the rest of the group. We will have a look at these two reasons for non-normality in turn.
The reason why we try and see the distributions as normal is that we have mathe- matical equations that can be used to draw a normal distribution. And we can use these equations in statistical tests. A lot of tests depend on the data being from a normal distribution. That is why sta- tisticians are often delighted to observe a normal distribution.
Descriptive Statistics 2211
Value
Frequency
Figure 2.5 Histogram showing positively skewed distribution
can be the wrong shape because it is not the characteristic bell shape – this is called kurtosis.
A non-symmetrical distribution is said to be skewed. Figures 2.5 and Figure 2.7 both show distributions which are non-symmetrical. Figure 2.5 shows positive skew: this is where the curve rises rapidly and then drops off slowly. Figure 2.7 shows negative skew, where the curve rises slowly and then decreases rapidly. Skew, as we shall see later on, has some seri- ous implications for some types of data analysis.
Negative skew starts off flat, like a minus sign. Positive skew starts off going up, like part of a plus sign.
Skew often happens because of a floor effect or a ceiling effect. A floor effect occurs when only few of your subjects are strong enough to get off the floor. If you are interested in measuring how strong people are, you can give them weights to lift up. If your weights are too heavy most of the people will not get the weights off the floor, but some can lift very heavy weights, and you will find that you get a positively skewed distribution, as shown in Figure 2.8. Or if you set a maths test that is too hard then most of the class will
Descriptive Statistics 2233
0
10
20
30
40
50
100kg 120kg 140kg 160kg Weight
Number
Figure 2.8 Histogram showing how many people lift different weights and illustrating a floor effect, which leads to positive skew
get zero and you won’t find out very much about who was really bad at maths, and who was just OK. Floor effects are common in many measures in psychology. For example, if we measure the levels of depression in a ‘normal’ population, we will find that most people are not very depressed, some are a little depressed and a small number are very depressed. The distribution of depression scores would look something like Figure 2.8. Often we want to measure how many times something has happened – and something cannot have happened less frequently than never. If we were interested in criminal behaviour, we could count the number of times a group of people had been arrested (Figure 2.9). Most people have never been arrested, some people have been arrested once (among them one of the authors), fewer have been arrested twice, and so on. In a similar way, if you were carrying out a study to see how high people could jump, but found that the only room available was one that had a very low ceiling, you would find that how high people could jump will be influenced by them banging their heads on the ceiling. Figure 2.10 shows the distribution that is found in this experiment. We find that most people cannot jump over a hurdle higher than 80 cm, because they bang their heads on the ceiling. A few short people can jump over such a barrier, before they hit their head. The ceiling effect causes negative skew and a lot of headaches. Ceiling effects are much less common in psychology, although they sometimes occur – most commonly when we are trying to ask questions to measure the range of some variable, and the questions are all too easy, or too low down the scale. For example, if you
2244 Understanding and Using Statistics in Psychology
0
10
20
30
40
50
60
Number of People
0 1 2 3 4 5 6 7 8 9 10
Figure 2.9 Histogram showing the number of times a group of people have been arrested
Have a look at the three distributions shown in Figure 2.12. If we told you that the middle distribution was normal, what would you say about the kurtosis of the other two? You might say that the bottom one is positively kurtosed, because there are too many people in the tails. You might say that the top one were negatively kurtosed, because there were too many people in the tails. You’d be wrong. They are all normally distributed, but they are just spread out differently. When comparing distributions in the terms of kurtosis, it’s hard to take into account the different spread, as well as the different shape. (Continued)
2266 Understanding and Using Statistics in Psychology
50 40 30 20 10 0 (a)
Frequency
50 40 30 20 10 0 (b) 50 40 30 20 10 0 (c)
Figure 2.11 Three different distribution randomly sampled from (a) a negatively kurtosed distribution, (b) a positively kurtosed distribution, and (c) a normal distribution
Although your distribution is approximately normal, you may find that there are a small number of data points that lie outside the distribution. These are called outliers. They are usually easily spotted on a histogram such as that in Figure 2.13. The data seem to be normally distributed, but there is just one awkward person out there on the right-hand side. Outliers are easy to spot but deciding what to do with them can be much trickier. If you have an outlier such as this you should go through some checks before you decide what to do. First, you should see if you have made an error. The most common cause of outliers is that are using a computer to analyse your data and you have made an error while entering them. Look for numbers that should not be there. If the maximum score on a test is 10, and someone has scored 66, then you have made a mistake.
Descriptive Statistics 2277
100 80 60 40 20 0 (a) 100 80 60 40 20 0 (b)
100 80 60 40 20 0 −5.00 −2.50 0.00 2.50 5. (c) Figure 2.12 Histogram (b) shows a normal distribution, but what about (a) and (c)?
(Continued)
The table below shows the variable x. We are going to find the mean of x.
x 5 6 7 8 9 10
To solve this problem we need to take the equation, and substitute each of the letters and symbols in it, with a number. Then we can work the answer out. We will take this one step at a time. If you think this is easy, that is good, because this is almost the hardest thing we are going to have to do. (And you thought statistics was a difficult subject!)
Descriptive Statistics 2299
x - --^ = 45 6
x - --^ = 5 +^6 +^7 +^8 +^9 +^10 6
x - --^ = 5 + 6 + 7 + 8 + 9 + 10 N
x - --^ =
∑ x N
We often need to make assumptions in everyday life. We assume that everyone else will drive on the left-hand side of the road (or the right-hand side, depending on which coun- try you are in). We assume that when we press a light switch, the light will come on. We assume that if we do enough work and understand enough then we will pass our exams. It’s a bit dis- turbing if our assumptions are wrong. If we find that passing exams is not related to the amount of work we do, but that to pass an exam we actually need to have an uncle who works in the exams office, then our assumption is wrong. Similarly, if we assume everyone will drive on the left, but actually some people drive on the right, again our assumptions are wrong. In both these cases, the things that we say will be wrong. ‘Work hard, and you’ll pass,’ is no longer correct. Assumptions in statistics are much the same as this. For our statistics to be correct, we need to make some assumptions. If these assumptions are wrong (statisticians usually say violated ), then some of the things we say (the results of our statistical analysis) will be wrong. However, while our assumptions will usually be broadly correct, they will never be exactly correct. People sometimes drive on the wrong side of the road (when parking, for example) and we manage not to crash into them (most of the time). In the same way, if our assumptions about data are wrong, but not too wrong, we need to be aware that our statistics will not be perfectly correct, but as long as the assumptions are not violated to any great extent, we will be OK.
When we calculate and interpret the mean, we are required to make two assumptions about our data.
When you use a computer program, it will not check that you are asking for sensible statistics. If you ask for the mean gender, the mean town people live in, and the mean shoe colour, it will give them. Just don’t put them into your report.
The second most common measure of central tendency is the median. It is the middle score in a set of scores. The median is used when the mean is not valid, which might be because the data are not symmetrically or normally distributed, or because the data are measured at an ordinal level. To obtain the median the scores should be placed in ascending order of size, from the smallest to the largest score. When there is an odd number of scores in the distribution, halve the number and take the next whole number up. This is the median. For example if there are 29 scores, the median is the 15th score. If there are an even number of scores, the median is the mean of the two middle scores. We are going to find the median of the variable x below:
x: 1 14 3 7 5 4 3
The first thing to do is to rearrange the scores, in order, from smallest to largest. The rearranged data are presented below.
x: 1 3 3 4 5 7 14
3300 Understanding and Using Statistics in Psychology