Understanding Means, Medians, Variance, and Standard Deviation in Applied Biostatistics, Study notes of Mathematical Methods

This document, authored by professor martin bland of the university of york, provides an introduction to the concepts of mean, median, variance, and standard deviation in the context of applied biostatistics. The calculation and interpretation of these measures of central tendency and variability, as well as their relationship to skewness and the normal distribution.

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

myohmy
myohmy 🇬🇧

4.8

(10)

297 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Applied Biostatistics
Mean and Standard Deviation
Martin Bland
Professor of Health Statistics
University of York
http://www-users.york.ac.uk/~mb55/
The mean
The arithmetic mean or average, usually referred to
simply as the mean is found by taking the sum of the
observations and dividing by their number.
The mean is often denoted by a little bar over the
symbol for the variable, e.g. .
The sample mean has much nicer mathematical
properties than the median and is thus more useful for
the comparison methods described later.
The median is a very useful descriptive statistic, but not
much used for other purposes.
x
Median, mean and skewness:
Mean FEV1 = 4.06. Median FEV1 = 4.1, so the median
is within 1% of the mean.
Mean triglyceride = 0.51. Median triglyceride = 0.46.
The median is 10% away from the mean.
If the distribution is symmetrical the sample mean and
median will be about the same, but in a skew
distribution they will usually be different.
If the distribution is skew to the right, as for serum
triglyceride, the mean will usually be greater, if it is skew
to the left the median will usually be greater.
This is because the values in the tails affect the mean
but not the median.
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Understanding Means, Medians, Variance, and Standard Deviation in Applied Biostatistics and more Study notes Mathematical Methods in PDF only on Docsity!

Applied Biostatistics

Mean and Standard Deviation

Martin Bland

Professor of Health Statistics University of York

http://www-users.york.ac.uk/~mb55/

The mean

The arithmetic mean or average , usually referred to simply as the mean is found by taking the sum of the observations and dividing by their number.

The mean is often denoted by a little bar over the symbol for the variable, e.g..

The sample mean has much nicer mathematical properties than the median and is thus more useful for the comparison methods described later.

The median is a very useful descriptive statistic, but not much used for other purposes.

x

Median, mean and skewness:

Mean FEV1 = 4.06. Median FEV1 = 4.1, so the median is within 1% of the mean.

Mean triglyceride = 0.51. Median triglyceride = 0.46. The median is 10% away from the mean.

If the distribution is symmetrical the sample mean and median will be about the same, but in a skew distribution they will usually be different.

If the distribution is skew to the right, as for serum triglyceride, the mean will usually be greater, if it is skew to the left the median will usually be greater.

This is because the values in the tails affect the mean but not the median.

Increasing the largest observation will pull the mean higher.

It will not affect the median.

0

20

40

60

80

Frequency

0 .5 1 1.5 2 Triglyceride Median Mean

Variability

The mean and median are measures of the central tendency or position of the middle of the distribution. We shall also need a measure of the spread, dispersion or variability of the distribution.

Variability

For use in the analysis of data, range and IQR are not satisfactory. Instead we use two other measures of variability: variance and standard deviation. These both measure how far observations are from the mean of the distribution. Variance is the average squared difference from the mean. Standard deviation is the square root of the variance.

Standard deviation

FEV1: s =  0.449 = 0.67 litres.

Frequency

2 3 4 5 6

0

5

10

15

20

FEV1 (litre)

x+2s x+s

x x-s

x-2s

Majority of observations within one SD of mean (usually about 2/3). Almost all within about two SD of mean (usually about 95%).

Standard deviation

Triglyceride: s = 0.04802 = 0.22 mmol/litre.

Majority of observations within one SD of mean (usually about 2/3). Almost all within about two SD of mean (usually about 95%), but those outside may be all at one end.

Frequency

Triglyceride

0 .5 1 1.5 2

0

20

40

60

80

x+2s x-s x+s

x-2s x

Standard deviation

Gestational age: s = 5.242 = 2.29 weeks.

Majority of observations within one SD of mean (usually about 2/3). Almost all within about two SD of mean (usually about 95%), but those outside may be all at one end.

x x-s x-2s x+s x+2s

0

100

200

300

400

500

Frequency

20 25 30 35 40 45 Gestational age (weeks)

Spotting skewness

If the mean is less than two standard deviations, two standard deviations below the mean will be negative.

For any variable which cannot be negative, this tells us that the distribution must be positively skew.

If the mean or the median is near to one end of the range or interquartile range, this tells us that the distribution must be skew. If the mean or median is near the lower limit it will be positively skew, if near the upper limit it will be negatively skew.

Spotting skewness

Triglyceride: median = 0.46, mean = 0.51, SD = 0.22, range = 0.15 to 1.66, IQR = 0.35 to 0. mmol/l.

These rules of thumb only work one way, e.g. mean may exceed two SD and distribution may still be skew.

Gestational age: median = 39, mean = 38.95, SD = 2.29, range = 21 to 44, IQR = 38 to 40 weeks.

The Normal distribution

Many statistical methods are only valid if we can assume that our data follow a distribution of a particular type, the Normal distribution. This is a continuous, symmetrical, unimodal distribution described by a mathematical equation, which we shall omit.

0

5

10

15

20

Frequency

2 3 4 5 6 FEV1 (litres)

The parameters (mean and variance) of a Normal distribution happen to be equal to the mean and variance. These two numbers tell us which member of the Normal family we have.

Mean=0, variance= is called the Standard Normal 0 distribution.

.

.

.

.

Relative frequency

density

-5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10 Normal variable Mn=0, Var=1 Mn=3, Var= Mn=3, Var=

The parameters (mean and variance) of a Normal distribution happen to be equal to the mean and variance. These two numbers tell us which member of the Normal family we have.

The distributions are the same in terms of standard deviations 0 from the mean.

.

.

.

.

Relative frequency

density

-5-4-3-2-1 0 1 2 3 4 5 6 7 8 9 10 Normal variable Mn=0, SD=1 Mn=3, SD= Mn=3, SD=

The Normal distribution is important for two reasons.

  1. Many natural variables follow it quite closely, certainly sufficiently closely for us to use statistical methods which require this.
  2. Even when we have a variable which does not follow a Normal distribution, if we the take the mean of a sample of observations, such means will follow a Normal distribution.

An illustration of the Central Limit Theorem

0

200

400

600

800

Frequency -.2 0 .2 .4 .6 .8 1 1. Uniform variable

Single Uniform variable

0

200

400

600

800

1000

Frequency -.2 0 .2 .4 .6 .8 1 1. Mean of two

Two Uniform variables

0

500

1000

1500

Frequency

-.2 0 .2 .4 .6 .8 1 1. Mean of four

Four Uniform variables

0

500

1000

1500

2000

Frequency

-.2 0 .2 .4 .6 .8 1 1. Mean of ten

Ten Uniform variables

There is no simple formula linking the variable and the area under the curve. Hence we cannot find a formula to calculate the frequency between two chosen values of the variable, nor the value which would be exceeded for a given proportion of observations. Numerical methods for calculating these things with acceptable accuracy were used to produce extensive tables of the Normal distribution. These numerical methods for calculating Normal frequencies have been built into statistical computer programs and computers can estimate them whenever they are needed.

Two numbers from tables of the Normal distribution:

  1. we expect 68% of observations to lie within one standard deviation from the mean,
  2. we expect 95% of observations to lie within 1. standard deviations from the mean. This is true for all Normal distributions, whatever the mean, variance, and standard deviation.