







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The anova model is a statistical technique used for analyzing quantitative data from experimental units. It involves comparing the mean differences between two or more populations or treatments. The concept of anova, including the notation, calculations for the test statistic, and interpretation of the results. It also includes an r example.
Typology: Exams
1 / 13
This page cannot be seen from the preview
Don't miss anything!








In studying methods for the analysis of quantitative data, we first focused on problems involving a single sample of numbers and then turned to a comparative analysis of two different such samples. In one sample problems, the data consisted of observations of individuals randomly selected from a single population.
In two sample problems, either the two samples were drawn from two different populations, or else two different treatments were applied to elements selected from a single population.
The analysis of variance or ANOVA refers to a collection of procedures for the analysis of responses from experimental units. The simplest ANOVA problem is referred to as a single factor or one way ANOVA and involves analysis either of data sampled from two or more populations or data in which two or more treatments have been used. As such, the ANOVA setup is a generalization of the two sample t-test.
The characteristic that differentiates the treatments or populations from one another is called the factor and the different treatments are referred to as the levels of the factor. Let’s begin with an example...
Briefly, say a farmer wants to investigate if flower production differs across gardens. There are three gardens: A, B, and C. We are given 10 weeks of data; the number of flowers grown per week per garden. As always we introduce some notation. Let:
j ≡ An index for treatments or populations being compared. K ≡ The number of treatments in total. Here K = 3.
i ≡ An index for the observations. nj ≡ The number of observations in each treatment. μj ≡ The mean of population or treatment j. Here i = 1 is the garden A, i = 2 is garden B, i = 3 is the garden B. Of course xj is the sample mean of the jth treatment or strategy.
We seek to test for a difference in gardens. Our null hypothesis is, then, that there is no difference in gardens vs. an alternative that there is at least one difference between gardens. In notation...:
Ho : μ 1 = μ 2 = μ 3
Ha : At least two means differ.
Now we need, of course, a test statistic - or a function of the observed values that we will link to some probability distribution. Let’s introduce some more notation...
Xi,j = the random variable that denotes the ith observation on the jth treatment. What is X 2 , 2?
xi,j = the observed value of Xi,j when the experiment is performed or the data is recorded.
The individual treatment means, that is the mean across treatments for each observation are calculated..
xi,. =
∑k j=1 xi,j k
The individual sample means, that is the means within treatments is calculated..
x.,j =
∑nj i=1 xi,j nj
Let’s call the total number of observations
n =
∑^ k
j=
nj
and then the grand mean is:
x.,. =
∑k j=
∑nj i=1 xi,j n
The interpretation of the identity is an important aid to an understanding of ANOVA. SST is a measure of the total variation in the data - sum of all squared deviations about the grand mean. The identity says that this total variation can be partitioned into two pieces. SSE measures variation that would be present (even if Ho is true and is thus the part of the total variation that is unexplained by the truth value of Ho. SStr is the amount of variation that can be explained by possible differences in the treatment means. If explained variation is large relative to unexplained variation, then Ho is rejected in favor of Ha.
We can divide each of the terms in the identity by its associated degrees of freedom to obtain mean square estimates.
SST n− 1 =^ σ
(^2) , Overall variance.
SST r k− 1 =^ M ST r, Mean Square for Treatments or Mean Treatment Variance. SSE n−k =^ M SE, Mean Square for Error.
A natural test for the comparison for the variances is the F-test using the F-distribution. Our observed test statistic, using the data:
f = M ST rM SE
We compare it to the F-distribution having: numerator degrees of freedom ν 1 = k − 1; denominator degrees of freedom ν 2 = n − k, type I error level α.
This is all often summarized in an ANOVA table.
Source of Variation df Sum of Squares Mean Square f Treatments k-1 SSTr MSTr=SSTr/(k-1) MSTr/MSE Error n-k SSE MSE=SSE/(n-k) Total n-1 SST
Here is the example in R
3 R Example
Our Null Hypothesis:
Ho : μ 1 = μ 2 = μ 3
Our Alternative Hypothesis:
Ha: at least two means differ
Our Test Statistic
F = M ST r/M SE
is the ratio of error across treatments - gardens - to that within treatment - i.e. error (or variance) from week to week within marketing strategy.
In R
data<-read.csv(file="gardens.txt")
names(data)
Did we get the right stuff?
data[1:5,] colmeans<-apply(data,2,mean) #the apply function applies functions over cols, 2, or rows,
Let’s take a brief look
boxplot(data, horizontal=T) #very very nice, many R functions produce nice plots when the data frame is name()’d
Now let’s calculate our observed values and test statistic
squareddata<-data^2 #just squares every term SST<-(sum(squareddata))-(sum(data)^2)/(dim(data)[1]dim(data)[2]) #unpack it, it’s the same as the formula SSTR<-sum((apply(data,2,sum)^2)(rep(1/20,3))) - (sum(data)^2)/ #dim() gives the rows and cols of a data frame SSE<-SST-SSTR MSTr<-SSTR/(3-1) MSE<-SSE/(60-3) f<-MSTr/MSE f
Now for p-values and critical statistic
pf(3.23,2,57,lower.tail=F)
qf(.05,2,57,lower.tail=F)
s^21 + s^22 2
n
=
s^21 + s^22 n
≈ s^21 + s^22 n − k = SSE/(n − k) = M SE
when n >> k. So look again at the t-statistic.
t^2 = ((x 1 − x 2 ))^2 (s^21 + s^22 )/n ∼ F
under the assumptions of the null hypothesis for the ANOVA. Why is this important? First, to highlight that an F-test for the ANOVA is a generalization of a t-test for difference of means. Which is nice to know and makes us feel the world has meaning. Second, to point out that a univariate confidence interval (for just 1 pair at a time) should be derived from the familiar t-test.
Fisher’s Least Significant Difference Method is just that, a generalization of the two sample pooled variance t-test. The confidence interval estimator is
(x 1 − x 2 ) ± LSD
LSD = tn−k,α/ 2
ni
nj
We reject the null hypothesis, Ho : μi = μj , if |xi − xj | > LSD.
We need to rethink how we set our α level. Remember that α ≡ PHo (Ho is rejected).
So (1 − α) = PHo (Ho is accepted).
So (1 − α)c^ = PHo, 1 (Ho, 1 is accepted) · · · PHo,c (Ho,c).
So let αc = 1 − (1 − α)c^ = PHo (Ho is rejected at least once).
We call αc the comparison wise type I error, and c is the number of possible pairwise comparisons. It is true that αc ≤ cα. We usually set, then, our α = αc/c. This is the Bonferroni Correction for the error.
We seek to test for a difference in strategies. Our null hypothesis is, then, that there are no differences in strategies vs. an alternative that there is at least one difference between strategies. In notation...
Let μi ≡ The mean for garden i. Then Ho,ij : μi = μj vs. Ha,ij : μi 6 = μj. There are c = 3(2) 2 = 3 of these hypothesis. To test them all simultaneously set α∗^ = α/c.
Let’s read in the data
data<-read.csv(file="gardens.txt")
This is a balanced design so each of the columns are of equal length.
alphastar<-.05/3;alphastar
dim(data);data[18:20,]
The Fisher LSD is
squareddata<-data^2 #just squares every term
SST<-(sum(squareddata))-(sum(data)^2)/(dim(data)[1]*dim(data)[2])
SSTR<-sum((apply(data,2,sum)^2)*(rep(1/20,3))) - (sum(data)^2)/
SSE<-SST-SSTR MSTr<-SSTR/(3-1) MSE<-SSE/(60-3)
#all of that from before, now the LSD
LSD<-qt(alphastar/2,57,lower.tail=F)sqrt(MSE(2/20))
LSD
How do the means look?
means<-mean(data) abs(means[1]-means[2]); abs(means[2]-means[3]); abs(means[3]-means[1])
order(mean(data)); mean(data)[order(mean(data))]
and the parameters β 1 , ..., βk by
βj = μj − μ
That is, let μ be the mean of the treatment means and then the treatment mean μj can be written as μ + βj where μ represents the true average overall response in the experiment. Then βj is the effect, measured as a departure from μ, due to the jth treatment.
When we add the constraint:
∑^ k
j=
βj = 0
(the average departure from the overall mean response is zero) we have (μ, β 1 , ..., βk− 1 ) independent parameters. Same as before.
In terms of our redefinition the model is
Xi,j = μ + βj + i,j
and the null hypothesis is
Ho : β 1 = β 2 = β 3 = ...
The probabilistic model illustrated above is: The random quantity (variable Xi,j ) is the amount (sales, say) we observe in general. The effect under treatment scenario j is βj and the observational, random, error (sales, say and week, say, respectively) i,j. In general - ANOVA models test the existence of a discrete treatment effect. Look at this equation again:
Xi,j = μ + βj + i,j
and let’s rewrite it, losing the index over the treatments and replacing the letter X with Y.
In order to ”lose” the index over the treatments we’ll need to indicate what treatment is being applied on the right hand side of the equation. Let’s do that using what is called an indicator variable, and let’s call that indicator variable Xj (now that X is free).^1
(^1) Xj = 1 if treatment j is operating, Xj = 0 otherwise. For example X 1 = 1 when we use the convenience sales strategy
Yi = μ + β 1 X 1 + β 2 X 2 + · · · + βkXk + i
Let’s make one more adjustment in this equation. Let’s call μ = β 0 for consistency. Now the full set of parameters is (β 0 , ..., βk). And the full equation, which we will now call a regression equation is^2
Yi = β 0 + β 1 X 1 + β 2 X 2 + · · · + βkXk + i
Notice that
E(Yi|X = xj ) = β 0 + βj
Statisticians are often looking to ”fit a model” by estimating population parameters (i.e. μ or β, etc.), often in the course of estimating quantities (i.e. x, y). Notationally, we want a way to highlight the estimated values, or functions of data that generate them – estimators. To do this: place hats such as ˆy (read: ”y-hat”) to indicate an estimate of the true observed value y; or bars such as X (read: ”X-bar”) to indicate the sample expected value of the random variable X; or y to indicate the computed sample mean of the observed values y 1 , y 2.
Let’s take a simple case where we have only two treatments (i.e. k=2). We rewrite the above equation:
Yi = β 0 + β 1 X 1 + i
Here we would let X 1 = 1 if we had treatment 1 (Convenience strategy, say), X 1 = 0 otherwise.
Bear with me and say we had some reasonable way of estimating the population mean β 0 and the difference in treatment effect β 1 ...So we have the estimates βˆ 0 ; βˆ 1. Then our estimate of the response variable yi is ˆyi and should be based on: our estimate of the population mean, estimate of the treatment effect, and the treatment and it is...
yˆi = βˆ 0 + βˆ 1 x 1
Notice that there is no here. What is our estimate ˆ? It is (^2) In general, to regress something on something else - Y on X say - is to predict the mean value of Y for an observed value of X. That is, E(Y |X = x) is the regression of Y on X – the conditional expectation of Y for a given X = x.
#and now, below look at it
plot(data)
boxplot(data) #the linear model function lm and aov, analysis of #variance both take arguments in the form of the models we
#the notes. Minus the parameter values -- those are understood.
lm1<-lm(y~x,data=data)
aov1<-aov(y~x,data=data)
summary(lm1)
summary(aov1)
aov(lm1) #notice the similarity of the t-test for the effect #parameter and the f-test for the aov model.
6 Exercises