Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Understanding the Analysis of Variance (ANOVA) Model - Prof. Kobi Abayomi, Exams of Data Analysis & Statistical Methods

Georgia Institute of Technology - Main Campus Data Analysis & Statistical Methods

Prof. Kobi Abayomi

The anova model is a statistical technique used for analyzing quantitative data from experimental units. It involves comparing the mean differences between two or more populations or treatments. The concept of anova, including the notation, calculations for the test statistic, and interpretation of the results. It also includes an r example.

Typology: Exams

Pre 2010

Uploaded on 08/04/2009

koofers-user-7q0-2 🇺🇸

10 documents

1 / 13

This page cannot be seen from the preview

Don't miss anything!

ISYE 2028 A and B

Spring 2009

Lecture 16

Dr. Kobi Abayomi

April 13, 2009

1 Introduction - The ”simplest” model - The ANOVA model

In studying methods for the analysis of quantitative data, we first focused on problems involving a

single sample of numbers and then turned to a comparative analysis of two different such samples.

In one sample problems, the data consisted of observations of individuals randomly selected from

a single population.

In two sample problems, either the two samples were drawn from two different populations, or else

two different treatments were applied to elements selected from a single population.

The analysis of variance or ANOVA refers to a collection of procedures for the analysis of responses

from experimental units. The simplest ANOVA problem is referred to as a single factor or one

way ANOVA and involves analysis either of data sampled from two or more populations or data in

which two or more treatments have been used. As such, the ANOVA setup is a generalization of

the two sample t-test.

The characteristic that differentiates the treatments or populations from one another is called the

factor and the different treatments are referred to as the levels of the factor. Let’s begin with an

example...

2 Single Factor or One Way ANOVA

2.1 Setup and Notation

Briefly, say a farmer wants to investigate if flower production differs across gardens. There are

three gardens: A,B, and C. We are given 10 weeks of data; the number of flowers grown per week

per garden. As always we introduce some notation. Let:

j≡An index for treatments or populations being compared.

K≡The number of treatments in total. Here K= 3.

1

Discover Exams of Data Analysis & Statistical Methods Georgia Institute of Technology - Main Campus

Partial preview of the text

Download Understanding the Analysis of Variance (ANOVA) Model - Prof. Kobi Abayomi and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!

ISYE 2028 A and B

Spring 2009

Lecture 16

Dr. Kobi Abayomi

April 13, 2009

1 Introduction - The ”simplest” model - The ANOVA model

In studying methods for the analysis of quantitative data, we first focused on problems involving a single sample of numbers and then turned to a comparative analysis of two different such samples. In one sample problems, the data consisted of observations of individuals randomly selected from a single population.

In two sample problems, either the two samples were drawn from two different populations, or else two different treatments were applied to elements selected from a single population.

The analysis of variance or ANOVA refers to a collection of procedures for the analysis of responses from experimental units. The simplest ANOVA problem is referred to as a single factor or one way ANOVA and involves analysis either of data sampled from two or more populations or data in which two or more treatments have been used. As such, the ANOVA setup is a generalization of the two sample t-test.

The characteristic that differentiates the treatments or populations from one another is called the factor and the different treatments are referred to as the levels of the factor. Let’s begin with an example...

2 Single Factor or One Way ANOVA

2.1 Setup and Notation

Briefly, say a farmer wants to investigate if flower production differs across gardens. There are three gardens: A, B, and C. We are given 10 weeks of data; the number of flowers grown per week per garden. As always we introduce some notation. Let:

j ≡ An index for treatments or populations being compared. K ≡ The number of treatments in total. Here K = 3.

i ≡ An index for the observations. nj ≡ The number of observations in each treatment. μj ≡ The mean of population or treatment j. Here i = 1 is the garden A, i = 2 is garden B, i = 3 is the garden B. Of course xj is the sample mean of the jth treatment or strategy.

We seek to test for a difference in gardens. Our null hypothesis is, then, that there is no difference in gardens vs. an alternative that there is at least one difference between gardens. In notation...:

Ho : μ 1 = μ 2 = μ 3

Ha : At least two means differ.

Now we need, of course, a test statistic - or a function of the observed values that we will link to some probability distribution. Let’s introduce some more notation...

Xi,j = the random variable that denotes the ith observation on the jth treatment. What is X 2 , 2?

xi,j = the observed value of Xi,j when the experiment is performed or the data is recorded.

The individual treatment means, that is the mean across treatments for each observation are calculated..

xi,. =

∑k j=1 xi,j k

The individual sample means, that is the means within treatments is calculated..

x.,j =

∑nj i=1 xi,j nj

Let’s call the total number of observations

n =

∑^ k

j=

nj

and then the grand mean is:

x.,. =

∑k j=

∑nj i=1 xi,j n

The interpretation of the identity is an important aid to an understanding of ANOVA. SST is a measure of the total variation in the data - sum of all squared deviations about the grand mean. The identity says that this total variation can be partitioned into two pieces. SSE measures variation that would be present (even if Ho is true and is thus the part of the total variation that is unexplained by the truth value of Ho. SStr is the amount of variation that can be explained by possible differences in the treatment means. If explained variation is large relative to unexplained variation, then Ho is rejected in favor of Ha.

We can divide each of the terms in the identity by its associated degrees of freedom to obtain mean square estimates.

SST n− 1 =^ σ

(^2) , Overall variance.

SST r k− 1 =^ M ST r, Mean Square for Treatments or Mean Treatment Variance. SSE n−k =^ M SE, Mean Square for Error.

A natural test for the comparison for the variances is the F-test using the F-distribution. Our observed test statistic, using the data:

f = M ST rM SE

We compare it to the F-distribution having: numerator degrees of freedom ν 1 = k − 1; denominator degrees of freedom ν 2 = n − k, type I error level α.

This is all often summarized in an ANOVA table.

Source of Variation df Sum of Squares Mean Square f Treatments k-1 SSTr MSTr=SSTr/(k-1) MSTr/MSE Error n-k SSE MSE=SSE/(n-k) Total n-1 SST

Here is the example in R

3 R Example

Our Null Hypothesis:

Ho : μ 1 = μ 2 = μ 3

Our Alternative Hypothesis:

Ha: at least two means differ

Our Test Statistic

F = M ST r/M SE

is the ratio of error across treatments - gardens - to that within treatment - i.e. error (or variance) from week to week within marketing strategy.

In R

data<-read.csv(file="gardens.txt")

names(data)

Did we get the right stuff?

data[1:5,] colmeans<-apply(data,2,mean) #the apply function applies functions over cols, 2, or rows,

Let’s take a brief look

boxplot(data, horizontal=T) #very very nice, many R functions produce nice plots when the data frame is name()’d

Now let’s calculate our observed values and test statistic

squareddata<-data^2 #just squares every term SST<-(sum(squareddata))-(sum(data)^2)/(dim(data)[1]dim(data)[2]) #unpack it, it’s the same as the formula SSTR<-sum((apply(data,2,sum)^2)(rep(1/20,3))) - (sum(data)^2)/ #dim() gives the rows and cols of a data frame SSE<-SST-SSTR MSTr<-SSTR/(3-1) MSE<-SSE/(60-3) f<-MSTr/MSE f

Now for p-values and critical statistic

pf(3.23,2,57,lower.tail=F)

qf(.05,2,57,lower.tail=F)

s^21 + s^22 2

n

=

s^21 + s^22 n

≈ s^21 + s^22 n − k = SSE/(n − k) = M SE

when n >> k. So look again at the t-statistic.

t^2 = ((x 1 − x 2 ))^2 (s^21 + s^22 )/n ∼ F

under the assumptions of the null hypothesis for the ANOVA. Why is this important? First, to highlight that an F-test for the ANOVA is a generalization of a t-test for difference of means. Which is nice to know and makes us feel the world has meaning. Second, to point out that a univariate confidence interval (for just 1 pair at a time) should be derived from the familiar t-test.

Fisher’s Least Significant Difference Method is just that, a generalization of the two sample pooled variance t-test. The confidence interval estimator is

(x 1 − x 2 ) ± LSD

LSD = tn−k,α/ 2

M SE(

ni

nj

We reject the null hypothesis, Ho : μi = μj , if |xi − xj | > LSD.

We need to rethink how we set our α level. Remember that α ≡ PHo (Ho is rejected).

So (1 − α) = PHo (Ho is accepted).

So (1 − α)c^ = PHo, 1 (Ho, 1 is accepted) · · · PHo,c (Ho,c).

So let αc = 1 − (1 − α)c^ = PHo (Ho is rejected at least once).

We call αc the comparison wise type I error, and c is the number of possible pairwise comparisons. It is true that αc ≤ cα. We usually set, then, our α = αc/c. This is the Bonferroni Correction for the error.

4.2 R example

We seek to test for a difference in strategies. Our null hypothesis is, then, that there are no differences in strategies vs. an alternative that there is at least one difference between strategies. In notation...

Let μi ≡ The mean for garden i. Then Ho,ij : μi = μj vs. Ha,ij : μi 6 = μj. There are c = 3(2) 2 = 3 of these hypothesis. To test them all simultaneously set α∗^ = α/c.

Let’s read in the data

data<-read.csv(file="gardens.txt")

This is a balanced design so each of the columns are of equal length.

alphastar<-.05/3;alphastar

dim(data);data[18:20,]

The Fisher LSD is

squareddata<-data^2 #just squares every term

SST<-(sum(squareddata))-(sum(data)^2)/(dim(data)[1]*dim(data)[2])

SSTR<-sum((apply(data,2,sum)^2)*(rep(1/20,3))) - (sum(data)^2)/

SSE<-SST-SSTR MSTr<-SSTR/(3-1) MSE<-SSE/(60-3)

#all of that from before, now the LSD

LSD<-qt(alphastar/2,57,lower.tail=F)sqrt(MSE(2/20))

LSD

How do the means look?

means<-mean(data) abs(means[1]-means[2]); abs(means[2]-means[3]); abs(means[3]-means[1])

order(mean(data)); mean(data)[order(mean(data))]

and the parameters β 1 , ..., βk by

βj = μj − μ

That is, let μ be the mean of the treatment means and then the treatment mean μj can be written as μ + βj where μ represents the true average overall response in the experiment. Then βj is the effect, measured as a departure from μ, due to the jth treatment.

When we add the constraint:

∑^ k

j=

βj = 0

(the average departure from the overall mean response is zero) we have (μ, β 1 , ..., βk− 1 ) independent parameters. Same as before.

In terms of our redefinition the model is

Xi,j = μ + βj + i,j

and the null hypothesis is

Ho : β 1 = β 2 = β 3 = ...

5.2 Extension to Linear Regression

The probabilistic model illustrated above is: The random quantity (variable Xi,j ) is the amount (sales, say) we observe in general. The effect under treatment scenario j is βj and the observational, random, error (sales, say and week, say, respectively) i,j. In general - ANOVA models test the existence of a discrete treatment effect. Look at this equation again:

Xi,j = μ + βj + i,j

and let’s rewrite it, losing the index over the treatments and replacing the letter X with Y.

In order to ”lose” the index over the treatments we’ll need to indicate what treatment is being applied on the right hand side of the equation. Let’s do that using what is called an indicator variable, and let’s call that indicator variable Xj (now that X is free).^1

(^1) Xj = 1 if treatment j is operating, Xj = 0 otherwise. For example X 1 = 1 when we use the convenience sales strategy

Yi = μ + β 1 X 1 + β 2 X 2 + · · · + βkXk + i

Let’s make one more adjustment in this equation. Let’s call μ = β 0 for consistency. Now the full set of parameters is (β 0 , ..., βk). And the full equation, which we will now call a regression equation is^2

Yi = β 0 + β 1 X 1 + β 2 X 2 + · · · + βkXk + i

Notice that

E(Yi|X = xj ) = β 0 + βj

5.3 Aside: Hats and Bars

Statisticians are often looking to ”fit a model” by estimating population parameters (i.e. μ or β, etc.), often in the course of estimating quantities (i.e. x, y). Notationally, we want a way to highlight the estimated values, or functions of data that generate them – estimators. To do this: place hats such as ˆy (read: ”y-hat”) to indicate an estimate of the true observed value y; or bars such as X (read: ”X-bar”) to indicate the sample expected value of the random variable X; or y to indicate the computed sample mean of the observed values y 1 , y 2.

5.4 Bear with the F-test again

Let’s take a simple case where we have only two treatments (i.e. k=2). We rewrite the above equation:

Yi = β 0 + β 1 X 1 + i

Here we would let X 1 = 1 if we had treatment 1 (Convenience strategy, say), X 1 = 0 otherwise.

Bear with me and say we had some reasonable way of estimating the population mean β 0 and the difference in treatment effect β 1 ...So we have the estimates βˆ 0 ; βˆ 1. Then our estimate of the response variable yi is ˆyi and should be based on: our estimate of the population mean, estimate of the treatment effect, and the treatment and it is...

yˆi = βˆ 0 + βˆ 1 x 1

Notice that there is no here. What is our estimate ˆ? It is (^2) In general, to regress something on something else - Y on X say - is to predict the mean value of Y for an observed value of X. That is, E(Y |X = x) is the regression of Y on X – the conditional expectation of Y for a given X = x.

#and now, below look at it

plot(data)

boxplot(data) #the linear model function lm and aov, analysis of #variance both take arguments in the form of the models we

specified earlier in

#the notes. Minus the parameter values -- those are understood.

lm1<-lm(y~x,data=data)

aov1<-aov(y~x,data=data)

summary(lm1)

summary(aov1)

aov(lm1) #notice the similarity of the t-test for the effect #parameter and the f-test for the aov model.

6 Exercises

13.1, 13.2 page 521
13.3, 13.6, 13.8 page 520-
13.11 - use Fisher’s LSD Method

Understanding the Analysis of Variance (ANOVA) Model - Prof. Kobi Abayomi, Exams of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Understanding the Analysis of Variance (ANOVA) Model - Prof. Kobi Abayomi and more Exams Data Analysis & Statistical Methods in PDF only on Docsity!

ISYE 2028 A and B

Spring 2009

Lecture 16

Dr. Kobi Abayomi

April 13, 2009

1 Introduction - The ”simplest” model - The ANOVA model

2 Single Factor or One Way ANOVA

2.1 Setup and Notation

M SE(

4.2 R example

5.2 Extension to Linear Regression

5.3 Aside: Hats and Bars

5.4 Bear with the F-test again

specified earlier in