## Search in the document preview

**Testing for Normality**

For each mean and standard deviation combination a theoretical normal distribution can be determined. This distribution is based on the proportions shown below.

This theoretical normal distribution can then be compared to the actual distribution of the data.

Are the actual data statistically different than the computed normal curve?

Theoretical normal
distribution ** calculated
**from a mean of 66.51 and a
standard deviation of
18.265.

The ** actual **data
distribution that has a
mean of 66.51 and a
standard deviation of
18.265.

There are several methods of assessing whether data are
normally distributed or not. They fall into two broad categories:
*graphical *and *statistical*. The some common techniques are:

Graphical • Q-Q probability plots • Cumulative frequency (P-P) plots

Statistical • W/S test • Jarque-Bera test • Shapiro-Wilks test • Kolmogorov-Smirnov test • D’Agostino test

Q-Q plots display the observed values against normally distributed data (represented by the line).

Normally distributed data fall along the line.

Graphical methods are typically not very useful when the sample size is small. This is a histogram of the last example. These data do not ‘look’ normal, but they are not statistically different than normal.

**Tests of Normality
**

.110 1048 .000 .931 1048 .000Age Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

Lil liefors Significance Correctiona.

**Tests of Normality
**

.283 149 .000 .463 149 .000TOTAL_VALU Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

Lil liefors Significance Correctiona.

**Tests of Normality
**

.071 100 .200* .985 100 .333Z100 Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

This is a lower bound of the true s ignificance.*.

Lil liefors Significance Correctiona.

Statistical tests for normality are more precise since actual probabilities are calculated.

Tests for normality calculate the probability that the sample was drawn from a normal population.

The hypotheses used are:

Ho: The sample data are not significantly different than a normal population.

Ha: The sample data are significantly different than a normal population.

Typically, we are interested in finding a difference between groups. When we are, we ‘look’ for small probabilities.

• If the probability of finding an event is rare (less than 5%) and we actually find it, that is of interest.

When testing normality, we are not ‘looking’ for a difference.

• In effect, we want our data set to be NO DIFFERENT than normal. We want to accept the null hypothesis.

So when testing for normality:

• Probabilities > 0.05 mean the data are normal.

• Probabilities < 0.05 mean the data are NOT normal.

**Non-Normally Distributed Data
**

.142 72 .001 .841 72 .000Average PM10 Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnov a Shapiro-Wilk

Lilliefors Significance Correctiona.

Remember that LARGE probabilities denote normally distributed data. Below are examples taken from SPSS.

**Normally Distributed Data
**

.069 72 .200* .988 72 .721Asthma Cases Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

This is a lower bound of the true s ignificance.*.

Lil liefors Significance Correctiona.

In SPSS output above the probabilities are greater than 0.05 (the typical alpha level), so we accept Ho… these data are not different from normal.

**Normally Distributed Data
**

.069 72 .200* .988 72 .721Asthma Cases Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova Shapiro-Wilk

This is a lower bound of the true s ignificance.*.

Lil liefors Significance Correctiona.

**Non-Normally Distributed Data
**

.142 72 .001 .841 72 .000Average PM10 Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnov a Shapiro-Wilk

Lilliefors Significance Correctiona.

In the SPSS output above the probabilities are less than 0.05 (the typical alpha level), so we reject Ho… these data are significantly different from normal.

** Important: **As the sample size

**, normality parameters becomes**

*increases***restrictive and it becomes harder to declare that the data are normally distributed. So for very large data sets, normality testing becomes less important.**

*MORE***Three Simple Tests for Normality**

**W/S Test for Normality
**

• A fairly simple test that requires only the sample standard deviation and the data range.

• Based on the q statistic, which is the ‘studentized’ (meaning t distribution) range, or the range expressed in standard deviation units. Tests kurtosis.

• Should not be confused with the Shapiro-Wilks test.

where *q *is the test statistic, *w *is the range of the data and *s *is
the standard deviation.

*s
wq *=

Range constant, SD changes

Range changes, SD constant

**Village
Population
**

**Density
Aranza 4.13
Corupo 4.53
San Lorenzo 4.69
Cheranatzicurin 4.76
Nahuatzen 4.77
Pomacuaran 4.96
Sevina 4.97
Arantepacua 5.00
Cocucho 5.04
Charapan 5.10
Comachuen 5.25
Pichataro 5.36
Quinceo 5.94
Nurio 6.06
Turicuaro 6.19
Urapicho 6.30
Capacuaro 7.73
**

Standard deviation (s) = 0.866 Range (w) = 3.6 n = 17

31.406.3

16.4 866.0 6.3

*toq
*

*q
*

*s
wq
*

*RangeCritical *=

==

=

The W/S test uses a critical range. IF the calculated value falls WITHIN the range, then accept Ho. IF the calculated value falls outside the range then reject Ho.

Since 3.06 < *q=4.16 *< 4.31, then we accept Ho.

Since we have a critical range, it is difficult to determine a probability range for our results. Therefore we simply state our alpha level.

The sample data set is not significantly different than normal (W/S4.16, p > 0.05).

3 1

3

3

)(

*ns
*

*xx
k
*

*n
*

*i
i*∑

=

− = 3

)(

4 1

4

4 − −

= ∑ =

*ns
*

*xx
k
*

*n
*

*i
i
*

( ) ( )

+=

246

2 4

2
3 *kknJB
*

Where ** x **is each observation,

**is the sample size,**

*n***is the standard deviation,**

*s***3 is skewness, and**

*k***is kurtosis.**

*k4***Jarque–Bera Test
**

A goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution.

**Village
Population
**

**Density
Mean
**

**Deviates
Mean
**

**Deviates3
Mean
**

**Deviates4
Aranza 4.13 -1.21 -1.771561 2.14358881
Corupo 4.53 -0.81 -0.531441 0.43046721
San Lorenzo 4.69 -0.65 -0.274625 0.17850625
Cheranatzicurin 4.76 -0.58 -0.195112 0.11316496
Nahuatzen 4.77 -0.57 -0.185193 0.10556001
Pomacuaran 4.96 -0.38 -0.054872 0.02085136
Sevina 4.97 -0.37 -0.050653 0.01874161
Arantepacua 5.00 -0.34 -0.039304 0.01336336
Cocucho 5.04 -0.30 -0.027000 0.00810000
Charapan 5.10 -0.24 -0.013824 0.00331776
Comachuen 5.25 -0.09 -0.000729 0.00006561
Pichataro 5.36 0.02 0.000008 0.00000016
Quinceo 5.94 0.60 0.216000 0.12960000
Nurio 6.06 0.72 0.373248 0.26873856
Turicuaro 6.19 0.85 0.614125 0.52200625
Urapicho 6.30 0.96 0.884736 0.84934656
Capacuaro 7.73 2.39 13.651919 32.62808641
**

**12.595722 37.433505
**

87.0 34.5

= =

*s
x*

13.1 )87.0)(17(

6.12
33 ==*k *843.03)87.0)(17(

43.37
44 =−=*k
*

( ) ( )

( )

12.4

0296.02128.017

24 711.0

6 2769.117

24 843.0

6 13.117

22

=

+=

+=

+=

*JB
*

*JB
*

*JB*

The Jarque-Bera statistic can be compared to the χ2 distribution (table) with 2
degrees of freedom (df or ** v**) to determine the critical value at an alpha level of
0.05.

The critical χ2 value is 5.991. Our calculated Jarque-Bera statistic is 4.12 which falls between 0.5 and 0.1, which is greater than the critical value.

Therefore we accept Ho that there is no difference between our distribution and a normal distribution (Jarque-Bera χ24.15, 0.5 > p > 0.1).

**D’Agostino Test
**

• A very powerful test for departures from normality.

• Based on the D statistic, which gives an upper and lower critical value.

where *D *is the test statistic, *SS *is the sum of squares of the data
and *n *is the sample size, and *i *is the order or rank of observation
*X*. The df for this test is n (sample size).

• First the data are ordered from smallest to largest or largest to smallest.

∑

+−== *iX
*

*niTwhere
SSn
*

*TD
*2

1 3

**Village
Population
**

**Density i
Mean
**

**Deviates2
Aranza 4.13 1 1.46410
Corupo 4.53 2 0.65610
San Lorenzo 4.69 3 0.42250
Cheranatzicurin 4.76 4 0.33640
Nahuatzen 4.77 5 0.32490
Pomacuaran 4.96 6 0.14440
Sevina 4.97 7 0.13690
Arantepacua 5.00 8 0.11560
Cocucho 5.04 9 0.09000
Charapan 5.10 10 0.05760
Comachuen 5.25 11 0.00810
Pichataro 5.36 12 0.00040
Quinceo 5.94 13 0.36000
Nurio 6.06 14 0.51840
Turicuaro 6.19 15 0.72250
Urapicho 6.30 16 0.92160
Capacuaro 7.73 17 5.71210
**

**Mean = 5.34 SS = 11.9916
**

2860.0,2587.0

26050.0 )9916.11)(17(

23.63

23.63 73.7)917...(69.4)93(53.4)92(13.4)91(

)9(

9 2

117 2

1

3

1

=

==

= −+−+−+−=

−=

= +

= +

∑

*CriticalD
*

*D
*

*T
T
*

*XiT
*

*n
*

Df = n = 17

If the calculated value falls within the critical range, accept Ho.

Since 0.2587 < *D = 0.26050 < *0.2860 accept Ho.

The sample data set is not significantly different than normal (D0.26050, p > 0.05).

46410.121.1)34.513.4( 22 =−=−

Use the next lower
*n *on the table if your
sample size is NOT
listed.