Chi-Square and ANOVA Tests: Hypothesis Analysis in Statistics, Lecture notes of Economics

An overview of Chi-Square and Analysis of Variance (ANOVA) tests, including calculations, examples, and instructions on how to perform these tests using technology. Chi-Square tests are used to determine if observed frequencies fit the expected frequencies, while ANOVA tests compare the means of multiple groups.

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

bartolix
bartolix 🇬🇧

4.8

(17)

301 documents

1 / 40

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 11: Chi-Square Tests and ANOVA
393
Chapter 11: Chi-Square and ANOVA Tests
This chapter presents material on three more hypothesis tests. One is used to determine
significant relationship between two qualitative variables, the second is used to determine
if the sample data has a particular distribution, and the last is used to determine
significant relationships between means of 3 or more samples.
Section 11.1: Chi-Square Test for Independence
Remember, qualitative data is where you collect data on individuals that are categories or
names. Then you would count how many of the individuals had particular qualities. An
example is that there is a theory that there is a relationship between breastfeeding and
autism. To determine if there is a relationship, researchers could collect the time period
that a mother breastfed her child and if that child was diagnosed with autism. Then you
would have a table containing this information. Now you want to know if each cell is
independent of each other cell. Remember, independence says that one event does not
affect another event. Here it means that having autism is independent of being breastfed.
What you really want is to see if they are not independent. In other words, does one
affect the other? If you were to do a hypothesis test, this is your alternative hypothesis
and the null hypothesis is that they are independent. There is a hypothesis test for this
and it is called the Chi-Square Test for Independence. Technically it should be called
the Chi-Square Test for Dependence, but for historical reasons it is known as the test for
independence. Just as with previous hypothesis tests, all the steps are the same except for
the assumptions and the test statistic.
Hypothesis Test for Chi-Square Test
1. State the null and alternative hypotheses and the level of significance
Ho:
the two variables are independent (this means that the one variable is not
affected by the other)
HA:
the two variables are dependent (this means that the one variable is affected
by the other)
Also, state your
α
level here.
2. State and check the assumptions for the hypothesis test
a. A random sample is taken.
b. Expected frequencies for each cell are greater than or equal to 5 (The expected
frequencies, E, will be calculated later, and this assumption means
E5
).
3. Find the test statistic and p-value
Finding the test statistic involves several steps. First the data is collected and
counted, and then it is organized into a table (in a table each entry is called a cell).
These values are known as the observed frequencies, which the symbol for an
observed frequency is O. Each table is made up of rows and columns. Then each
row is totaled to give a row total and each column is totaled to give a column
total.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28

Partial preview of the text

Download Chi-Square and ANOVA Tests: Hypothesis Analysis in Statistics and more Lecture notes Economics in PDF only on Docsity!

Chapter 11: Chi-Square Tests and ANOVA

Chapter 11: Chi-Square and ANOVA Tests

This chapter presents material on three more hypothesis tests. One is used to determine

significant relationship between two qualitative variables, the second is used to determine

if the sample data has a particular distribution, and the last is used to determine

significant relationships between means of 3 or more samples.

Section 11.1: Chi-Square Test for Independence

Remember, qualitative data is where you collect data on individuals that are categories or

names. Then you would count how many of the individuals had particular qualities. An

example is that there is a theory that there is a relationship between breastfeeding and

autism. To determine if there is a relationship, researchers could collect the time period

that a mother breastfed her child and if that child was diagnosed with autism. Then you

would have a table containing this information. Now you want to know if each cell is

independent of each other cell. Remember, independence says that one event does not

affect another event. Here it means that having autism is independent of being breastfed.

What you really want is to see if they are not independent. In other words, does one

affect the other? If you were to do a hypothesis test, this is your alternative hypothesis

and the null hypothesis is that they are independent. There is a hypothesis test for this

and it is called the Chi-Square Test for Independence. Technically it should be called

the Chi-Square Test for Dependence, but for historical reasons it is known as the test for

independence. Just as with previous hypothesis tests, all the steps are the same except for

the assumptions and the test statistic.

Hypothesis Test for Chi-Square Test

  1. State the null and alternative hypotheses and the level of significance
H

o

: the two variables are independent (this means that the one variable is not

affected by the other)

H

A

: the two variables are dependent (this means that the one variable is affected

by the other)

Also, state your α level here.

  1. State and check the assumptions for the hypothesis test

a. A random sample is taken.

b. Expected frequencies for each cell are greater than or equal to 5 (The expected

frequencies, E , will be calculated later, and this assumption means E ≥ 5 ).

  1. Find the test statistic and p-value

Finding the test statistic involves several steps. First the data is collected and

counted, and then it is organized into a table (in a table each entry is called a cell).

These values are known as the observed frequencies, which the symbol for an

observed frequency is O. Each table is made up of rows and columns. Then each

row is totaled to give a row total and each column is totaled to give a column

total.

Chapter 11: Chi-Squared Tests and ANOVA

The null hypothesis is that the variables are independent. Using the multiplication

rule for independent events you can calculate the probability of being one value of

the first variable, A , and one value of the second variable, B (the probability of a

particular cell P A and B

). Remember in a hypothesis test, you assume that H

0

is true, the two variables are assumed to be independent.

P A and B

= P A
⋅ P B

if A and B are independent

number of ways A can happen

total number of individuals

number of ways B can happen

total number of individuals

row total

n

column total

n

Now you want to find out how many individuals you expect to be in a certain cell.

To find the expected frequencies, you just need to multiply the probability of that

cell times the total number of individuals. Do not round the expected frequencies.

Expected frequency cell A and B

= E A and B

= n

row total

n

column total

n

row total ⋅column total

n

If the variables are independent the expected frequencies and the observed

frequencies should be the same. The test statistic here will involve looking at the

difference between the expected frequency and the observed frequency for each

cell. Then you want to find the “total difference” of all of these differences. The

larger the total, the smaller the chances that you could find that test statistic given

that the assumption of independence is true. That means that the assumption of

independence is not true. How do you find the test statistic? First find the

differences between the observed and expected frequencies. Because some of

these differences will be positive and some will be negative, you need to square

these differences. These squares could be large just because the frequencies are

large, you need to divide by the expected frequencies to scale them. Then finally

add up all of these fractional values. This is the test statistic.

Test Statistic:

The symbol for Chi-Square is χ

2

χ

2

O − E

2

E

where O is the observed frequency and E is the expected frequency

Chapter 11: Chi-Squared Tests and ANOVA

Table #11.1.1: Autism Versus Breastfeeding

Autism

Breast Feeding Timelines

Row

Total

None Less

than 2

months

2 to 6

months

More

than 6

months

Yes 241 198 164 215 818

No 20 25 27 44 116

Column Total 261 223 191 259 934

Solution:

  1. State the null and alternative hypotheses and the level of significance
H

o

: Breastfeeding and autism are independent

H

A

: Breastfeeding and autism are dependent

α = 0.

  1. State and check the assumptions for the hypothesis test

a. A random sample of breastfeeding time frames and autism incidence was

taken.

b. Expected frequencies for each cell are greater than or equal to 5 (ie. E ≥ 5 ).

See step 3. All expected frequencies are more than 5.

  1. Find the test statistic and p-value

Test statistic:

First find the expected frequencies for each cell.

E ( Autism and no breastfeeding) =

E Autism and < 2 months

E Autism and 2 to 6 months

E ( Autism and more than 6 months) =

Others are done similarly. It is easier to do the calculations for the test statistic

with a table, the others are in table #11.1.2 along with the calculation for the test

statistic. (Note: the column of OE should add to 0 or close to 0.)

Chapter 11: Chi-Square Tests and ANOVA

Table #11.1.2: Calculations for Chi-Square Test Statistic

O E
O − E
O − E

2

O − E

2

E

Total 0.000 1

11.2166432 = χ

2

The test statistic formula is χ

2

O − E

2

E

, which is the total of the last

column in table #11.1.2.

p-value:

df = ( 2 − 1 )* ( 4 − 1 ) = 3

Using TI-83/84: χcdf (11.2166432 ,1E 99 , 3 ) ≈ 0.

Using R: 1 − pchisq (11.2166432 , 3 ) ≈ 0.

  1. Conclusion

Fail to reject H

o

since the p-value is more than 0.01.

  1. Interpretation

There is not enough evidence to show that breastfeeding and autism are

dependent. This means that you cannot say that the whether a child is breastfed or

not will indicate if that the child will be diagnosed with autism.

Example #11.1.2: Hypothesis Test with Chi-Square Test Using Technology

Is there a relationship between autism and breastfeeding? To determine if there

is, a researcher asked mothers of autistic and non-autistic children to say what

time period they breastfed their children. The data is in table #11.1.1 (Schultz,

Klonoff-Cohen, Wingard, Askhoomoff, Macera, Ji & Bacher, 2006). Do the data

provide enough evidence to show that that breastfeeding and autism are

independent? Test at the1% level.

Solution:

  1. State the null and alternative hypotheses and the level of significance
H

o

: Breastfeeding and autism are independent

H

A

: Breastfeeding and autism are dependent

α = 0.

Chapter 11: Chi-Square Tests and ANOVA

Now type the table in by pressing ENTER after each cell value. Figure #11.1.

contains the complete table typed in. Once you have the data in, press QUIT.

Figure #11.1.4: Data Typed into Matrix

To run the test on the calculator, go into STAT, then move over to TEST and

choose χ

2

  • Test from the list. The setup for the test is in figure #11.1.5.

Figure #11.1.5: Setup for Chi-Square Test on TI-83/

Once you press ENTER on Calculate you will see the results in figure #11.1.6.

Figure #11.1.6: Results for Chi-Square Test on TI-83/

The test statistic is χ

2

≈ 11.2167 and the p-value is p ≈ 0.01061. Notice that the

calculator calculates the expected values for you and places them in matrix B. To

Chapter 11: Chi-Squared Tests and ANOVA

review the expected values, go into MATRX and choose 2:[B]. Figure #11.1.

shows the output. Press the right arrows to see the entire matrix.

Figure #11.1.7: Expected Frequency for Chi-Square Test on TI-83/

To compute the test statistic and p-value with R,

row1 = c(data from row 1 separated by commas)

row2 = c(data from row 2 separated by commas)

keep going until you have all of your rows typed in.

data.table = rbind(row1, row2, …) – makes the data into a table. You can

call it what ever you want. It does not have to be data.table.

data.table – use if you want to look at the table

chisq.test(data.table) – calculates the chi-squared test for independence

chisq.test(data.table)$expected – let’s you see the expected values

For this example, the commands would be

row1 = c(241, 198, 164, 215)

row2 = c(20, 25, 27, 44)

data.table = rbind(row1, row2)

data.table

Output:

[,1] [,2] [,3] [,4]

row1 241 198 164 215

row2 20 25 27 44

chisq.test(data.table)

Chapter 11: Chi-Squared Tests and ANOVA

Table #11.1.3: Number of Leprosy Cases

WHO Region

World Bank Income Group

Row

Total

High

Income

Upper

Middle

Income

Lower

Middle

Income

Low

Income

Americas 174 36028 615 0 36817

Eastern

Mediterranean

Europe 10 0 0 0 10

Western

Pacific

Africa 0 39 1986 15928 17953

South-East

Asia

Column Total 264 36289 158069 27923 222545

Solution:

  1. State the null and alternative hypotheses and the level of significance
H

o

: WHO region and Income Level when dealing with the disease of leprosy are

independent

H

A

: WHO region and Income Level when dealing with the disease of leprosy are

dependent

α = 0.

  1. State and check the assumptions for the hypothesis test

a. A random sample of incidence of leprosy was taken from different countries

and the income level and WHO region was taken.

b. Expected frequencies for each cell are greater than or equal to 5 (ie. E ≥ 5 ).

See step 3. There are actually 4 expected frequencies that are less than 5, and

the results of the test may not be valid. If you look at the expected

frequencies you will notice that they are all in Europe. This is because Europe

didn’t have many cases in 2011.

  1. Find the test statistic and p-value

Test statistic:

First find the expected frequencies for each cell.

E ( Americas and High Income) =

E ( Americas and Upper Middle Income) =

E ( Americas and Lower Middle Income) =

E ( Americas and Lower Income) =

Chapter 11: Chi-Square Tests and ANOVA

Others are done similarly. It is easier to do the calculations for the test statistic

with a table, and the others are in table #11.1.4 along with the calculation for the

test statistic.

Table #11.1.4: Calculations for Chi-Square Test Statistic

O E
O − E
O − E

2

O − E

2

E

Total 0.

328594.008 = χ

2

The test statistic formula is χ

2

O − E

2

E

, which is the total of the last

column in table #11.1.2.

p-value:

df = ( 6 − 1 )* ( 4 − 1 ) = 15

Using the TI-83/84: χcdf ( 328594.008,1E 99 , 15 ) ≈ 0

Using R: 1 − pchisq ( 328594.008, 15 ) ≈ 0

  1. Conclusion

Reject H

o

since the p-value is less than 0.05.

Chapter 11: Chi-Square Tests and ANOVA

Figure #11.1.8: Setup for Matrix on TI-83/

Figure #11.1.9: Results for Chi-Square Test on TI-83/

χ

2

Figure #11.1.10: Expected Frequency for Chi-Square Test on TI-83/

Press the right arrow to look at the other expected frequencies.

Chapter 11: Chi-Squared Tests and ANOVA

p-value:

pvalue ≈ 0

Using R:

row1=c(174, 36028, 615, 0)

row2=c(54, 6, 1883, 604)

row3=c(10, 0, 0, 0)

row4=c(26, 216, 3689, 1155)

row5=c(0, 39, 1986, 15928)

row6=c(0, 0, 149896, 10236)

chisq.test(data.table)

Pearson's Chi-squared test

data: data.table

X-squared = 328590, df = 15, p-value < 2.2e- 16

Warning message:

In chisq.test(data.table) : Chi-squared approximation may be incorrect

chisq.test(data.table)$expected

[,1] [,2] [,3] [,4]

row1 43.67515783 6003.514404 2.615034e+04 4619.

row2 3.02144735 415.323117 1.809080e+03 319.

row3 0.01186277 1.630637 7.102788e+00 1.

row4 6.03340448 829.341724 3.612478e+03 638.

row5 21.29722977 2927.481709 1.275164e+04 2252.

row6 189.96089780 26111.708410 1.137384e+05 20091.9 62686

Warning message:

In chisq.test(data.table) : Chi-squared approximation may be incorrect

χ

2

= 328590 and p-value = 2.2 × 10

− 16

  1. Conclusion

Reject H

o

since the p-value is less than 0.05.

  1. Interpretation

There is enough evidence to show that WHO region and income level are

dependent when dealing with the disease of leprosy. WHO can decide how to

focus their efforts based on region and income level. Do remember though that

the results may not be valid due to the expected frequencies not all be more than

Chapter 11: Chi-Squared Tests and ANOVA

3.) Is there a relationship between autism and what an infant is fed? To determine if

there is, a researcher asked mothers of autistic and non-autistic children to say

what they fed their infant. The data is in table #11.1.7 (Schultz, Klonoff-Cohen,

Wingard, Askhoomoff, Macera, Ji & Bacher, 2006). Do the data provide enough

evidence to show that that what an infant is fed and autism are independent? Test

at the1% level.

Table #11.1.7: Autism Versus Breastfeeding

Autism

Feeding

Row

Total

Brest-

feeding

Formula

with

DHA/ARA

Formula

without

DHA/ARA

Yes 12 39 65 116

No 6 22 10 38

Column

Total

4.) A person’s educational attainment and age group was collected by the U.S.

Census Bureau in 1984 to see if age group and educational attainment are related.

The counts in thousands are in table #11.1.8 ("Education by age," 2013). Do the

data show that educational attainment and age are independent? Test at the 5%

level.

Table #11.1.8: Educational Attainment and Age Group

Education

Age Group Row

25 - 34 35 - 44 45 - 54 55 - 64 >64 Total

Did not

complete

HS

Competed

HS

College 1- 3

years

College 4 or

more years

Column

Total 40173 20057 22240 22034 26290 130794

Chapter 11: Chi-Square Tests and ANOVA

5.) Students at multiple grade schools were asked what their personal goal (get good

grades, be popular, be good at sports) was and how important good grades were to

them (1 very important and 4 least important). The data is in table #11.1.

("Popular kids datafile," 2013). Do the data provide enough evidence to show

that goal attainment and importance of grades are independent? Test at the 5%

level.

Table #11.1.9: Personal Goal and Importance of Grades

Goal

Grades Importance Rating

1 2 3 4 Row Total

Grades 70 66 55 56 247

Popular 14 33 45 49 141

Sports 10 24 33 23 90

Column Total 94 123 133 128 478

6.) Students at multiple grade schools were asked what their personal goal (get good

grades, be popular, be good at sports) was and how important being good at sports

were to them (1 very important and 4 least important). The data is in table

#11.1.10 ("Popular kids datafile," 2013). Do the data provide enough evidence to

show that goal attainment and importance of sports are independent? Test at the

5% level.

Table #11.1.10: Personal Goal and Importance of Sports

Goal

Sports Importance Rating

1 2 3 4 Row Total

Grades 83 81 55 28 247

Popular 32 49 43 17 141

Sports 50 24 14 2 90

Column Total 165 154 112 47 478

7.) Students at multiple grade schools were asked what their personal goal (get good

grades, be popular, be good at sports) was and how important having good looks

were to them (1 very important and 4 least important). The data is in table

#11.1.11 ("Popular kids datafile," 2013). Do the data provide enough evidence to

show that goal attainment and importance of looks are independent? Test at the

5% level.

Table #11.1.11: Personal Goal and Importance of Looks

Goal

Looks Importance Rating

Row Total 1 2 3 4

Grades 80 66 66 35 247

Popular 81 30 18 12 141

Sports 24 30 17 19 90

Column Total 185 126 101 66 478

Chapter 11: Chi-Square Tests and ANOVA

Section 11.2: Chi-Square Goodness of Fit

In probability, you calculated probabilities using both experimental and theoretical

methods. There are times when it is important to determine how well the experimental

values match the theoretical values. An example of this is if you wish to verify if a die is

fair. To determine if observed values fit the expected values, you want to see if the

difference between observed values and expected values is large enough to say that the

test statistic is unlikely to happen if you assume that the observed values fit the expected

values. The test statistic in this case is also the chi-square. The process is the same as for

the chi-square test for independence.

Hypothesis Test for Goodness of Fit Test

  1. State the null and alternative hypotheses and the level of significance
H

o

: The data are consistent with a specific distribution

H

A

: The data are not consistent with a specific distribution

Also, state your α level here.

  1. State and check the assumptions for the hypothesis test

a. A random sample is taken.

b. Expected frequencies for each cell are greater than or equal to 5 (The expected

frequencies, E , will be calculated later, and this assumption means E ≥ 5 ).

  1. Find the test statistic and p-value

Finding the test statistic involves several steps. First the data is collected and

counted, and then it is organized into a table (in a table each entry is called a cell).

These values are known as the observed frequencies, which the symbol for an

observed frequency is O. The table is made up of k entries. The total number of

observed frequencies is n. The expected frequencies are calculated by

multiplying the probability of each entry, p , times n.

Expected frequency entry i

= E = n * p

Test Statistic:

χ

2

O − E

2

E

where O is the observed frequency and E is the expected frequency

Again, the test statistic involves squaring the differences, so the test statistics are

all positive. Thus a chi-squared test for goodness of fit is always right tailed.

p-value:

Using the TI- 83 /84: χcdf lower limit,1E 99 , df

Using R:

1 − pchisq χ

2

, df

Where the degrees of freedom is df = k − 1

Chapter 11: Chi-Squared Tests and ANOVA

  1. Conclusion

This is where you write reject H

o

or fail to reject H

o

. The rule is: if the p-value

< α , then reject H

o

. If the p-value ≥ α , then fail to reject H

o

  1. Interpretation

This is where you interpret in real world terms the conclusion to the test. The

conclusion for a hypothesis test is that you either have enough evidence to show

H

A

is true, or you do not have enough evidence to show H

A

is true.

Example #11.2.1: Goodness of Fit Test Using the Formula

Suppose you have a die that you are curious if it is fair or not. If it is fair then the

proportion for each value should be the same. You need to find the observed

frequencies and to accomplish this you roll the die 500 times and count how often

each side comes up. The data is in table #11.2.1. Do the data show that the die is

fair? Test at the 5% level.

Table #11.2.1: Observed Frequencies of Die

Die values 1 2 3 4 5 6 Total

Observed Frequency 78 87 87 76 85 87 100

Solution:

  1. State the null and alternative hypotheses and the level of significance
H

o

: The observed frequencies are consistent with the distribution for fair die (the

die is fair)

H

A

: The observed frequencies are not consistent with the distribution for fair die

(the die is not fair)

α = 0.

  1. State and check the assumptions for the hypothesis test

a. A random sample is taken since each throw of a die is a random event.

b. Expected frequencies for each cell are greater than or equal to 5. See step 3.

  1. Find the test statistic and p-value

First you need to find the probability of rolling each side of the die. The sample

space for rolling a die is {1, 2, 3, 4, 5, 6}. Since you are assuming that the die is

fair, then P ( 1 ) = P ( 2 ) = P ( 3 ) = P ( 4 ) = P ( 5 ) = P ( 6 ) =

Now you can find the expected frequency for each side of the die. Since all the

probabilities are the same, then each expected frequency is the same.

Expected frequency = E = n * p = 500 *