Contingency Tables: Independence and Homogeneity - Prof. John Burke, Study notes of Statistics

An introduction to contingency tables, which are used to analyze the relationship between two categorical variables. The concepts of independence and homogeneity, and includes examples and instructions for performing a chi-square test of independence. The document also discusses the assumptions and interpretation of the test results.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-isj
koofers-user-isj 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Sierra College – Math 13
Spring 2009 – Class 30/32
Today: Sections 11-3; 10-1/10-2
Assignment: 11-3 {1, 3, 7, 9, 13, 17}
10-2 {1, 3, 5, 7, 9, 13, 17, 19, 23}
Next: Sections 10-2/10-3
Instructor: John Burke
Web Page: http://math.sierracollege.edu/Staff/JohnBurke/
Telephone: 916 337-0425
Office hours: (V-307) MW 2:35-5:00; M 2:45-3:45 (official)
2
11-3 Contingency Tables:
Independence and Homogeneity
A contingency table (or two-way frequency table) is a
table in which frequencies corres pond to two variables. (One
variable is used to categorize rows, and a second variable is
used to categorize columns.)
Example: Titanic passengers categorized as (Survived,
Died) by (Men, Women, Boys, and Girls).
Example: Male survey respondents to an abortion rights
question categorized as (Agree, Disagree) by (Male
Interviewer and Female Interviewer).
3
Example: Students categorized as (Male and
Female) by (Non-Smoker, Smoker).
Example: Respondents to a question about use of
the TV remote categorized as (Male and F emale)
by (Often, Sometimes, or Almost Never).
In these cases we are looking to det ermine a
dependency relationship between the row variable
and column variable, though it is important to note that
dependency does NOT establish causality.
11-3 Contingency Tables:
Independence and Homogeneity
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Contingency Tables: Independence and Homogeneity - Prof. John Burke and more Study notes Statistics in PDF only on Docsity!

1

Sierra College – Math 13

Spring 2009 – Class 30/

Today: Sections 11-3; 10-1/10-

Assignment: 11-3 {1, 3, 7, 9, 13, 17} 10-2 {1, 3, 5, 7, 9, 13, 17, 19, 23} Next: Sections 10-2/10-

Instructor: John Burke

E-mail: [email protected] Web Page: http://math.sierracollege.edu/Staff/JohnBurke/

Telephone: 916 337-

Office hours: (V-307) MW 2:35-5:00; M 2:45-3:45 (official)

2

11-3 Contingency Tables:

Independence and Homogeneity

A contingency table (or two-way frequency table ) is a table in which frequencies correspond to two variables. (One variable is used to categorize rows, and a second variable is used to categorize columns.)

  • Example : Titanic passengers categorized as (Survived, Died) by (Men, Women, Boys, and Girls).
  • Example : Male survey respondents to an abortion rights question categorized as (Agree, Disagree) by (Male Interviewer and Female Interviewer). - Example : Students categorized as (Male and Female) by (Non-Smoker, Smoker). - Example : Respondents to a question about use of the TV remote categorized as (Male and Female) by (Often, Sometimes, or Almost Never).

In these cases we are looking to determine a dependency relationship between the row variable and column variable, though it is important to note that dependency does NOT establish causality.

11-3 Contingency Tables:

Independence and Homogeneity

4

11-3 Contingency Tables:

Independence

A test of independence tests the null hypothesis

H 0 : There is no association between the row variable and the column variable; i.e. the row and column variables are independent. H 1 : The variables in question are not independent.

χ^2 Test for Independence (assumptions)

  • The sample data are randomly selected.
  • For every cell, the expected frequency is at least 5.

5

χ^2 Test for Independence

Test Statistic :

Critical Values :

  • The critical values are found in Table A-4 by using degrees of freedom = (r – 1)(c – 1) , where r is the number of rows and c is the number of columns.
  • In a test of independence with a contingency table, the critical region is located in the right tail only.

2

2 (^ O^ E )

E

Relationships Among Components in Independence Hypothesis Test

Compare the observed (O) values to the corresponding expected (E) values.

Small X^2 value means large p-value Large X^2 value means small p-value

Fail to reject independence

Reject independence

O s and E s are far apart

O s and E s are close

10

In a test of homogeneity , we test the claim that different populations have the same proportions of some characteristic.

Example : Male survey respondents to an abortion rights question categorized as (Agree, Disagree) by (Male Interviewer and Female Interviewer).

χ^2 Test for Homogeneity

11

χ^2 Test for Homogeneity

Example : Male survey respondents to an abortion rights question categorized as (Agree, Disagree) by (Male Interviewer and Female Interviewer).

  • Use a 0.05 significance level.
  • H 0 : The proportions of agree/disagree responses are the same for the subjects interviewed by men and the subjects interviewed by women.
  • H 1 : The proportions are different.

Men who disagree

Men who agree 240 92

560 308

Man Women

10-1 / 10-3 Correlation and

Regression

In Chapter 10, we examine relationships between paired quantitative data.

We use collected data to

  • Observe a pattern (correlation – 10-2)
  • Mathematically model the pattern (regression – 10-3)
  • When appropriate, use the mathematical model to make predictions.

13

Chapter 10 Problem:

Can We Predict the Time of the Next Eruption of Old Faithful?

Is there a relationship between any two variables?

Can we predict how long it will be to the next eruption based upon duration, interval before, or height?

Height (L 4 )* 140 110 125 120 140 120 125 150

Interval After Eruption (L 3 )* 92 65 72 94 83 94 101 87

Interval Before Eruption (L 2 )* 98 90 92 98 93 105 81 108

Duration (L 1 )* 240 120 178 234 235 269 255 220

Eruptions of the Old Faithful Geyser

  • Enter the data in your calculator/StatDisk

14

10-2 Correlation

Paired sample data is sometimes called bivariate data.

A correlation exists between two variables when one of them is related to the other in some way.

We can often see if a relationship exists by using a scatterplot (or scatter diagram ), a graph in which the paired (x, y) sample data are plotted with each pair represented as a single point.

Assumptions : we will consider only linear relationships, which means that when graphed, the points approximate a straight line. (Recall slope and direction of line.)

Positive Linear Correlation

x x

y y y

x (b) Strong positive

(c) Perfect positive

(a) Positive

19

Scatter Plots for the Chapter Problem

StatDisk: Analysis Æ Correlation and Regression

Interval After (L 3 ) vs. Duration (L 1 )

Interval After (L 3 ) vs. Height (L 4 )

Interval After (L 3 ) vs. Interval Before (L 2 )

20

Linear Correlation Coefficient

The ( Pearson ) correlation coefficient r measures the strength of the linear relationship between the paired x- and y-quantitative values in a sample.

Assumptions The sample of paired data is a random sample. The pairs of (x, y) data have a bivariate normal distribution.

2 2 2 2

n xy x y r n x x n y y

Notation for r

n = number of pairs of data presented Σ denotes the addition of the items indicated. Σ x denotes the sum of all x values. Σ x^2 indicates that each x score should be squared and then those squares added. ( Σ x )^2 indicates that the x scores should be added and the total then squared. Σ xy indicates that each x score should be first multiplied by its corresponding y score. After obtaining all such products, find their sum. r represents the linear correlation coefficient for a sample ρ (rho) represents the linear correlation coefficient for a population

2 2 2 2

n xy x y r n x x n y y

22

Properties of r

The value of r does not change if all values of either variable are converted to a different scale. The value of r is not affected by the choice of x or y. Interchange all x- and y- values and the value of r will not change. r measures the strength of a linear relationship. It is not designed to measure the strength of a relationship that is not linear. r^2 is the proportion of the variation in y that is explained by the linear relationship between x and y.

The value of r is always between -1 and +1 inclusive.

2 2 2 2

n xy x y r n x x n y y

23

Table A-

Interpreting r using Table A-6 :

If the absolute value of the computed value of r exceeds the value in Table A-6, conclude that there is a significant linear correlation.

Otherwise, there is not sufficient evidence to support the conclusion of a significant linear correlation.

4 (^56) (^78) 9 (^1011) (^1213) 14 (^1516) (^1718) 19 (^2025) (^3035) 40 (^4550) (^6070) 80 10090

n . .959. .875. . .765. .708. . .641. .606. . .561. .463. . .378. .330. . .269.

. .878. .754. . .632. .576. . .514. .482. . .444. .361. . .294. .254. . .207.

α = .05^ α^ =.

Common Errors Involving Correlation

Causation : It is wrong to conclude that correlation implies causality (Remember eating lobster and its “effect” on pregnancy).

Averages : Averages suppress individual variation and may inflate the correlation coefficient.

Linearity : There may be some relationship between x and y even when there is no significant linear correlation.