






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This chapter introduces numerical methods for summarizing data, focusing on measures of location (mean, median, mode) and spread (range, inter-quartile range, sample variance, and standard deviation). The document also discusses the use of box plots for displaying data and comparing groups.
Typology: Summaries
1 / 12
This page cannot be seen from the preview
Don't miss anything!







So far we have only considered graphical methods for presenting data. These are always useful starting points. As we shall see, however, for many purposes we might also require numerical methods for summarising data: perhaps one or two numbers can summarize the key information about location and variability in the data. Before we introduce some ways of summarising data numerically, let us first think about some notation.
Before we can talk more about numerical techniques we first need to define some basic notation. This will allow us to generalise all situations with a simple shorthand.
Very often in statistics we replace actual numbers with letters in order to be able to write gen- eral formulae. We generally use a single letter to represent sample data and use subscripts to distinguish individual observations in the sample. Amongst the most common letters to use is x, although y and z are frequently used as well. For example, suppose we ask a random sample of three people how many mobile phone calls they made yesterday. We might get the following data: 1, 5, 7. If we take another sample we will most likely get different data, say 2, 0, 3. Using algebra we can represent the general case as x 1 , x 2 , x 3 :
1st sample 1 5 7 2nd sample 2 0 3 typical sample x 1 x 2 x 3
This can be generalised further by referring to the data as a whole as x and the ith observation in the sample as xi. Hence, in the first sample above, the second observation is x 2 = 5 whilst in the second sample it is x 2 = 0. The letters i and j are most commonly used as the index numbers for the subscripts.
The total number of observations in a sample is usually referred to by the letter n. Hence in our simple example above n = 3.
The next important piece of notation to introduce is the symbol
. This is the upper case of the Greek letter “sigma”. It is used to represent the phrase “sum the values”. This symbol is used as follows:
∑^ n
i=
xi = x 1 + x 2 + · · · + xn.
This notation is used to represent the sum of all the values in our data (from the first i = 1 to the last i = n), and is often abbreviated to
x when we sum over all the data in our sample.
Two other mathematical basics need to be introduced. First, the use of powers is important in many statistical formulae. We all know that, for example, the square of three means raising 3 to the power 2 , i.e. 32 = 3 × 3 = 9. This can be generalised to xk^ , which means multiplying x by itself k times.
The other important idea is the use of brackets. Brackets are used to impose an ordering on the way operations are carried out. The operation inside the bracket is carried out before the one outside. Consider the following three cases:
3 + 4^2 = 19 32 + 4^2 = 25 (3 + 4)^2 = 49.
In the first case, we simply square 4 and then add this to 3. In the second case, we square both numbers and then add them together, while in the third case, because of the brackets, we add the numbers together and then square the result. Each one of these seemingly similar formulae gives a very different result. If we consider the last two formulae in general terms we could represent the second as
x^2 , that is, we raise all the xs to the power 2 and then add them together. The third equation can be represented as (
x)^2 , that is, all the xs are summed together and then this sum raised to the power 2. This is an important distinction which we will use later.
These are also referred to as measures of centrality or, more commonly, averages. In general terms, they tell us the value of a “typical” observation. There are three measures which are commonly used: the mean, the median and the mode. We will consider these in turn.
The arithmetic mean is perhaps the most commonly used measure of location. We often refer to it as the average or just the mean. The arithmetic mean is calculated by simply adding all our data together and dividing by the number of data we have. So if our data were 10, 12, and 14, then our mean would be
10 + 12 + 14 3
These data can be presented as the frequency table
Cars Sold (x(j)) Frequency (fj ) x(j) × fj 4 1 4 5 1 5 6 2 12 7 2 14 8 3 24 9 2 18 10 2 20 11 1 11 Total n = 14 108
The sample mean can be calculated from these data as
x ¯ =
We can express this calculation of the sample mean from discrete tabulated data as
¯x =
n
∑^ k
j=
x(j) × fj.
Here the different values of X which occur in the data are x(1), x(2),... , x(k). In the example x(1) = 4, x(2) = 5, · · · , x(k) = 11 and k = 8. If we only have grouped frequency data, it is still possible to approximate the value of the sample mean. Consider the following (ordered) data:
8.4 8.7 9.0 9.0 9.2 9.3 9.3 9.5 9.6 9. 9.6 9.7 9.7 9.9 10.3 10.4 10.5 10.7 10.8 11.
The sample mean of these data is 9.73. Grouping these data into a frequency table gives
Class Interval mid–point (mj ) Frequency (fj ) fj × mj
When the raw data are not available, we don’t know where each observation lies in each interval. The best we can do is to assume that all the values in each interval lie at the central value of the interval, that is, at its mid–point. Therefore, the (approximate) sample mean is calculated using the frequencies (fj ) and the mid–points (mj ) as
x ¯ =
n
∑^ k
j=
fj × mj.
For the grouped data above, we obtain
x ¯ =
This value is fairly close to the correct sample mean and is a reasonable approximation given the partial information we have in the table.
For large samples with narrow intervals, this approximate value will be very close to the correct sample mean (calculated using the raw data).
The median is occasionally used instead of the mean, particularly when the data have an asym- metric profile (as indicated by a histogram – think back to last week) or there are outlying or unusual observations. The median is the middle value of the observations when they are listed in ascending order. It is straightforward to determine the median for small data sets, particularly via a stem and leaf plot.
The median is that value that has half the observations above it and half below. For exam- ple, ordering the student alcohol data gives { 0 , 0 , 6 , 10 , 16 , 21 , 52 }. Clearly the middle value is 10, so the median is 10 units per week.
Suppose we also asked four Stage 2 Marketing and Management students how many units of alcohol they drank last week, and got { 21 , 0 , 12 , 14 }. Ordering the data gives { 0 , 12 , 14 , 21 } and there are now two middle values in the sample, 12 and 14. If there are two middle values we take the average of these two numbers as the median, so in this case the median is (12 + 14)/2 = 13 units per week.
In general, the median is calculated as the ( n + 1 2
)th smallest observation in the sample.
For example, with the original alcohol data there were n = 7 observations and so the median was the n + 1 2
= 4th^ smallest observation,
which is what we observed previously; for these data the median is 10 units per week.
For the second alcohol dataset we had n = 4 and so the median was the n + 1 2
= 2. 5 th^ smallest observation,
which just means that it is half-way between the 2nd and 3rd smallest observations. Again, this is what we found; the median is 13 units per week.
It is possible to estimate the median value from an ogive as it is half way through the ordered data and hence is at the 50% level of the cumulative frequency. The accuracy of this estimate will depend on the accuracy of the drawn ogive.
A measure of location is insufficient in itself to summarise data as it only describes the value of a typical outcome and not how much variation there is in the data. For example, consider the following two samples
Sample 1 6 22 38 mean = 22 median = 22 Sample 2 21 22 23 mean = 22 median = 22
Both samples have the same measures of location but they are clearly very different samples! The first set of data ranges considerably from the mean or median value while the second stays very close. Neither the mean nor the median fully represents the data. As well as knowing the location statistics of a data set, we also need to know how variable or ‘spread-out’ our data are.
There are three basic measures of spread which we will consider: the range, the inter–quartile range and the sample variance.
This is the simplest measure of spread. It is simply the difference between the largest and smallest observations. In our simple example above the range for the first set of numbers is 38 − 6 = 32 and for the second set it is 23 − 21 = 2. These clearly describe very different data sets. The first set has a much wider range than the second.
There are two problems with the range as a measure of spread. When calculating the range you are looking at the two most extreme points in the data, and hence the value of the range can be unduly influenced by one particularly large or small value, known as an outlier. The second problem is that the range is only really suitable for comparing (roughly) equally sized samples as it is more likely that large samples contain the extreme values of a population.
The inter–quartile range describes the range of the middle half of the data and so is less prone to the influence of the extreme values.
To calculate the inter–quartile range (IQR) we simply divide the ordered data into four quarters. The three values that split the data into these quarters are called the quartiles. The first quartile (lower quartile, Q 1 ) has 25% of the data below it; the second quartile (median, Q 2 ) has 50% of the data below it; and the third quartile (upper quartile, Q 3 ) has 75% of the data below it. We already know how to find the median, and the other quartiles are calculated as follows:
(n + 1) 4
th smallest observation
3(n + 1) 4
th smallest observation.
Just as with the median, these quartiles might not correspond to actual observations. For ex- ample, in a dataset with n = 20 values, the lower quartile is the (20 + 1)/4 = 5 14 th smallest observation, that is, a quarter of the way between the 5th and 6th smallest observations. This calculation is essentially the same process we used when calculating the median. Consider the data:
8.4 8.7 9.0 9.0 9.2 9.3 9.3 9.5 9.6 9. 9.6 9.7 9.7 9.9 10.3 10.4 10.5 10.7 10.8 11.
Here the 5th and 6th smallest observations are 9.2 and 9.3 respectively. Therefore, the lower quartile is
Q1 = 9.2 +
Similarly the upper quartile is the 3 × (20 + 1)/4 = 15 34 smallest observation, that is, three quarters of the way between the 15th and 16th smallest observations which are 10.3 and 10.4, respectively; so
The inter–quartile range is simply the difference between the upper and lower quartiles, that is IQR = Q 3 − Q 1
which for these data is
The inter-quartile range can also be estimated from the ogives in a similar manner to the me- dian. Simply draw the ogive and then read off the values for 75% and 25% and calculate the difference between them. This is especially useful if you only have grouped data. Again the accuracy depends on the quality of your graph.
The inter–quartile range is useful as it allows us to make comparisons between the ranges of two data sets, without the problems caused by outliers or uneven sample sizes.
Note also that a different calculation is needed when the data are given in the form of a grouped frequency table with frequencies (fi) in intervals with mid–points (mi). First the sample mean x ¯ is approximated (as described earlier) and then the sample variance is approximated as
s^2 =
n − 1
{ (^) k ∑
i=
fim^2 i − n (¯x)^2
Box plots (or “box and whisker” plots) are another graphical method for displaying data and are particularly useful for highlighting differences between groups, for example, different spending patterns between males and females or comparing pricing within designated market segments. These plots use some of the key summary statistics we have looked at earlier, the quartiles and also the maximum and minimum observations. The plot is constructed as follows. After laying out an x–axis for the full range of the data, a rectangle is drawn with ends at the the upper and lower quartiles. The rectangle is split into two at the median. This is the “box”. Finally, lines are drawn from the box to the minimum and maximum values – these are the “whiskers”.
Suppose that, from our data, we obtain the following summary statistics:
Minimum Lower Quartile (Q1) Median (Q2) Upper Quartile (Q3) Maximum 10 40 43 45 50
In the space below, construct the associated box plot.
Displaying group structure is one of the main uses of box plots. Shown below is a plot produced by Minitab.
It clearly shows that although there is overlap between the three sets of data, the first and second datasets contain roughly similar responses and that these are quite different from those in the third set. Note that the asterisks (*) at the ends of the whiskers is the way Minitab highlights outlying values.