ANOVA in SPSS (Practical)

Centre for

Multilevel

Modelling

The development of this E-Book has been

supported by the British Academy.

Analysis of Variance practical

In this practical we will investigate how we model the influence of a categorical predictor on a continuous response.

In this example, we will test formally for an association between a student's test score on the PISA science test, SCISCORE, and their

parents' educational attainment, an indicator often used to capture a family's socio-economic resources. The variable PAREDU classifies

highest parental educational qualification into three categories and, as such, this an appropriate predictor or independent variable to use in

ANOVA (known as a factor with three levels in ANOVA terminology). The statistical test explores whether the mean science test score is

different across the three educational groups. Evidence of achievement gaps of this kind is often used in the discussion and quantification

of inter-generational social mobility.

ANOVA in SPSS (Practical)

In this case we are interested in the effect of PAREDU on our response variable SCISCORE. PAREDU has 3 categories: Low: GCSE or equiv,

Medium: A-level or equiv, High: University degree. We will begin to look at this relationship graphically and look at the distribution of

SCISCORE separately for each category and a good way to do this is via a box plot. To do this in SPSS:

Select Boxplot from the Legacy Dialogs submenu of the Graphs menu.

Select Simple and Summaries for groups of cases and click on the Define button.

Transfer the Science test score[SCISCORE] variable to the Variable box.

Transfer the Highest qualification of parent[PAREDU] variable to the Category Axis box.

Click on the OK button.

This will produce a table detailing the numbers of valid observations which we do not show here and then a plot with a box for each

category as shown below:

A boxplot uses the 1st and 3rd quantiles of the data to form the box with the median presented as a line in the middle of the box. In this

case the medians are for each category as follows: Low: GCSE or equiv = 502.5175, Medium: A-level or equiv = 523.0240 and for High:

University degree = 561.2170 and we see that category High: University degree has the highest median whilst category Low: GCSE or equiv

has the lowest median.

In fact we will be fitting an analysis of variance (ANOVA) model and this is based on group means rather than medians and so an

alternative error bar plot is useful. To get this in SPSS do the following:

Select Error Bar from the Legacy Dialogs submenu of the Graphs menu.

Select Simple and Summaries for groups of cases as for the boxplot and click on the Define button.

Transfer the Science test score[SCISCORE] variable to the Variable box.

Transfer the Highest qualification of parent[PAREDU] variable to the Category Axis box.

Click on the OK button.

The graph will appear as shown below:

ANOVA in SPSS (Practical), Study notes of Statistics

Related documents

Partial preview of the text

Download ANOVA in SPSS (Practical) and more Study notes Statistics in PDF only on Docsity!

Centre for

Multilevel

Modelling

The development of this E-Book has been

supported by the British Academy.

Analysis of Variance practical

In this practical we will investigate how we model the influence of a categorical predictor on a continuous response.

In this example, we will test formally for an association between a student's test score on the PISA science test, SCISCORE, and their

parents' educational attainment, an indicator often used to capture a family's socio-economic resources. The variable PAREDU classifies

highest parental educational qualification into three categories and, as such, this an appropriate predictor or independent variable to use in

ANOVA (known as a factor with three levels in ANOVA terminology). The statistical test explores whether the mean science test score is

different across the three educational groups. Evidence of achievement gaps of this kind is often used in the discussion and quantification

of inter-generational social mobility.

In this case we are interested in the effect of PAREDU on our response variable SCISCORE. PAREDU has 3 categories: Low: GCSE or equiv,

Medium: A-level or equiv, High: University degree. We will begin to look at this relationship graphically and look at the distribution of

SCISCORE separately for each category and a good way to do this is via a box plot. To do this in SPSS:

Select Boxplot from the Legacy Dialogs submenu of the Graphs menu.

Select Simple and Summaries for groups of cases and click on the Define button.

Transfer the Science test score[SCISCORE] variable to the Variable box.

Transfer the Highest qualification of parent[PAREDU] variable to the Category Axis box.

Click on the OK button.

This will produce a table detailing the numbers of valid observations which we do not show here and then a plot with a box for each

category as shown below:

A boxplot uses the 1st and 3rd quantiles of the data to form the box with the median presented as a line in the middle of the box. In this

case the medians are for each category as follows: Low: GCSE or equiv = 502.5175, Medium: A-level or equiv = 523.0240 and for High:

University degree = 561.2170 and we see that category High: University degree has the highest median whilst category Low: GCSE or equiv

has the lowest median.

In fact we will be fitting an analysis of variance (ANOVA) model and this is based on group means rather than medians and so an

alternative error bar plot is useful. To get this in SPSS do the following:

Select Error Bar from the Legacy Dialogs submenu of the Graphs menu.

Select Simple and Summaries for groups of cases as for the boxplot and click on the Define button.

Transfer the Science test score[SCISCORE] variable to the Variable box.

Transfer the Highest qualification of parent[PAREDU] variable to the Category Axis box.

Click on the OK button.

The graph will appear as shown below:

Here we are now plotting the means rather than the medians and the error bars represent 95% confidence intervals for the different

groups. The highest mean is 549.1397 for category High: University degree whilst the lowest mean is 502.7856 for category Low: GCSE or

equiv. Here we see that all pairs of categories overlap and so we might not expect an effect of PAREDU on SCISCORE.

We have mentioned summary statistics here (medians and means) and we can access these via the Explore option as follows:

Choose Explore from the Descriptives submeny within the Analyse menu.

Add Science test score[SCISCORE] to the Dependent list.

Add Highest qualification of parent[PAREDU] to the Factor list.

Click on the OK button.

The summary statistics will appear as below.

Click on the Continue button.

Click on the OK button.

This will produce lots of outputs so we will talk about these in turn. First Levene's test:

Tests the null hypothesis that the error variance of the dependent variable is equal across groups.

The Levene's test is used to test one of the underlying assumptions of the ANOVA which is the homogeneity of variances i.e. that the

residual variances are equal in each group. This test requires a test statistic that has value here 12.142 and under the hypothesis of equal

variances this statistic follows an F distribution with 2 and 4756 degrees of freedom where 2 is the number of categories - 1 and 4756 is the

number of observations - the number of categories. Here we see the p value is .000 which is less than 0.05 and therefore significant which

means we can reject the null hypothesis and so the assumption of equal variances is not true and we should adjust the variable in some

way (maybe via a transformation) if we wish to use an ANOVA. The ANOVA itself is described by the ANOVA table given below:

The above table gives all the information required for us to decide if PAREDU is a significant predictor of SCISCORE. We will here go

through the table column by column to explain how the ANOVA works. SPSS gives rather a lot of rows in its ANOVA tables largely because

it also allows one to test the intercept term which we are less interested in here. So we will begin with the Type III Sum of Squares (SS)

column and for the row Corrected total we see the value 49647340.403. This value is calculated by for each observation taking the value for

SCISCORE and subtracting the overall mean of SCISCORE. These values are then squared and their sum is the value 49647340.403 we see

in the table. (Note the value 1379293783.089 in the Total row is calculated similarly but without subtracting the mean from each

observation). Next the value in row PAREDU is calculated by working out the mean of SCISCORE for each category in PAREDU. We take

these category means and subtract the overall mean from them and again square and sum them to give the value 1770497.879. The

Corrected Model row you will see in the one way ANOVA simply repeats the PAREDU row. Finally the Error row sum of squares is

calculated by subtracting the PAREDU SS value from the Corrected total SS value i.e. 47876842.524 = 49647340.403 - 1770497.879, which

is effectively the sums of squares not explained by PAREDU. For PAREDU to be a significant predictor of SCISCORE we hope its SS value is

large relative to the Error SS but these numbers are based on different sample sizes and so we first need to adjust them to reflect this so

the next column is the degrees of freedom (df) column. Here we see we have 4759 total degrees of freedom which represents the number

of observations but 4758 corrected total df as we lose one by estimating the mean. For PAREDU we have 2 df which is the number of

categories - 1 again losing one as if we knew the mean and all bar one of the category means we could calculated the last one. Finally the

Error df is 4756 which again is calculated by subtraction i.e. 4756 = 4758 - 2. We next use the df to adjust the SS into Mean Squares (MS)

and so MS for PAREDU is SS for PAREDU divided by df for PAREDU which means 885248.939 = 1770497.879 / 2. Similarly for the Error

column we have an MS of 10066.620. These two mean squares are now on the same scale and so we can look at their relative sizes by

taking their ratio so F = 87.939 = 885248.939 / 10066.620. This test statistic follows an F distribution with 2 and 4756 degrees of freedom

and equates to a p value of .000 given in the Sig. column. This p value is less than 0.05 and so we can reject the null hypothesis and we find

that PAREDU is a significant predictor of SCISCORE.