Linear Regression: Health Spending vs. GDP in OECD Countries - Prof. Mary Kathryn Cowles, Study notes of Statistics

A lecture note from a bayesian statistics course (22s:138) at an unspecified university, focusing on linear regression analysis. An introduction to the topic, reviews the frequentist approach, and presents the relationship between per capita health spending and per capita gross domestic product (gdp) in 24 oecd countries in 1989. It covers concepts such as scatterplots, linear functions, error terms, and calculating predicted values and residuals.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-ux9
koofers-user-ux9 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Introduction to Linear Regression
22S:138 Bayesian Statistics
Lecture 14
October 16, 2006
Kate Cowles, Ph.D.
1
Review of Frequentist Approach to Linear
Regression
Per Capita Health Spending and Per Capita Gross
Domestic Product (GDP) in 24 OECD Countries,
1989
Schieber, Poullier, and Greenwald, Health Affairs, 1991
Country Per Cap Hlth Per Cap GDP
1. united states 2051 18.1429
2. canada 1483 17.2857
3. iceland 1241 15.5714
4. sweden 1233 13.8571
5. switzerland 1225 15.8571
6. norway 1149 15.5714
7. france 1105 12.2857
8. germany 1093 13.4286
9. luxemborg 1050 14.8571
10. netherlands 1041 13.0000
11. austria 982 11.8571
12. finland 949 12.8571
13. australia 939 12.2857
14. japan 915 13.4286
15. belgium 879 11.8571
16. italy 841 12.4286
17. denmark 792 13.5714
18. united kingdom 758 12.4286
19. new zealand 733 10.8571
20. ireland 561 7.8571
21. spain 521 8.8571
22. portugal 386 6.5714
23. greece 337 6.4286
24. turkey 148 4.4286
2
In regression analysis, we look at the conditional distribution
of the response variable at different levels of a predictor
variable
Response variable
also called “dependent” or “outcome” variable
what we want to explain or predict
in simple linear regression, response variable is continu-
ous
Predictor variables
also called ”independent” variables or ”covariates”
in simple linear regression, predictor variable usually is
also continuous
How we define which variable is response and which is
predictor depends on our research question.
3
Per Capita Health Spending and Per Capita Gross
Domestic Product (GDP) In 24 OECD Countries,
1989
4
pf3
pf4
pf5

Partial preview of the text

Download Linear Regression: Health Spending vs. GDP in OECD Countries - Prof. Mary Kathryn Cowles and more Study notes Statistics in PDF only on Docsity!

Introduction to Linear Regression

22S:138 Bayesian Statistics

Lecture 14

October 16, 2006

Kate Cowles, Ph.D.

1

Review of Frequentist Approach to Linear

Regression

Per Capita Health Spending and Per Capita Gross

Domestic Product (GDP) in 24 OECD Countries,

Schieber, Poullier, and Greenwald, Health Affairs, 1991

Country Per Cap Hlth Per Cap GDP

  1. united states 2051 18.
  2. canada 1483 17.
  3. iceland 1241 15.
  4. sweden 1233 13.
  5. switzerland 1225 15.
  6. norway 1149 15.
  7. france 1105 12.
  8. germany 1093 13.
  9. luxemborg 1050 14.
  10. netherlands 1041 13.
  11. austria 982 11.
  12. finland 949 12.
  13. australia 939 12.
  14. japan 915 13.
  15. belgium 879 11.
  16. italy 841 12.
  17. denmark 792 13.
  18. united kingdom 758 12.
  19. new zealand 733 10.
  20. ireland 561 7.
  21. spain 521 8.
  22. portugal 386 6.
  23. greece 337 6.
  24. turkey 148 4.

2

In regression analysis, we look at the conditional distribution

of the response variable at different levels of a predictor

variable

• Response variable

– also called “dependent” or “outcome” variable

– what we want to explain or predict

– in simple linear regression, response variable is continu-

ous

• Predictor variables

– also called ”independent” variables or ”covariates”

– in simple linear regression, predictor variable usually is

also continuous

– How we define which variable is response and which is

predictor depends on our research question.

Per Capita Health Spending and Per Capita Gross

Domestic Product (GDP) In 24 OECD Countries,

Scatterplots

  • response variable on Y axis
  • predictor variable on X axis
  • Relationship in this scatterplot looks roughly linear.
    • Makes sense to try to summarize the relationship be-

tween these two variables with a straight line.

5

Quick review of linear functions

Y = β 0 + β 1 X

  • Y is a response variable that is a linear function of the

predictor variable X

  • β 0 : intercept; the value of Y when X = 0
  • β 1 : slope; how much Y changes when X increases by 1 unit

6

Linear regression

  • In linear regression analysis, β 0 +β 1 X represents the mean

value of all the Y’s for a given value of X.

E(Y |X) = β 0 + β 1 X

  • There is an entire distribution of Y values for each value of

X (a conditional distribution)

  • Example: for any given value of per capita GDP, there

is a distribution of values of per capita health spending

among OECD countries

  • We say the relationship between X and Y is linear if the

means of the conditional distributions of Y |X lie on a straight

line.

Error terms

  • In regression, we represent factors other than Xi that affect

Yi with an error term, i.

  • population model

i = Yi − (β 0 + β 1 Xi)

i = Yi − E[Yi]

  • or, equivalently,

Yi = (β 0 + β 1 Xi) + epsiloni

Yi = E[Yi] + epsiloni

Calculating predicted values and residuals

Per capita health expenditures and per capita GDP

Dep Var Predict Std Err Lower95% Upper95% Obs NAME PCH Value Predict Mean Mean

1 UnitedStates 2051.0 1558.9 63.075 1428.1 1689. 2 Canada 1483.0 1467.0 56.251 1350.3 1583. 3 Iceland 1241.0 1283.1 43.858 1192.1 1374. 4 Sweden 1233.0 1099.2 34.641 1027.4 1171. 5 Switzerland 1225.0 1313.7 45.765 1218.8 1408. 6 Norway 1149.0 1283.1 43.858 1192.1 1374. 7 France 1105.0 930.6 31.480 865.4 995. 8 Germany 1093.0 1053.2 33.165 984.5 1122. 9 Luxemborg 1050.0 1206.5 39.487 1124.6 1288. 10 Netherland 1041.0 1007.3 32.127 940.6 1073. 11 Austria 982.0 884.7 31.771 818.8 950. 12 Finland 949.0 991.9 31.886 925.8 1058. 13 Australia 939.0 930.6 31.480 865.4 995. 14 Japan 915.0 1053.2 33.165 984.5 1122. 15 Belgium 879.0 884.7 31.771 818.8 950. 16 Italy 841.0 946.0 31.497 880.6 1011. 17 Denmark 792.0 1068.5 33.611 998.8 1138. 18 UnitedKingdom 758.0 946.0 31.497 880.6 1011. 19 NewZealand 733.0 777.4 34.322 706.2 848. 20 Ireland 561.0 455.6 52.341 347.1 564. 21 Spain 521.0 562.9 45.201 469.1 656. 22 Portugal 386.0 317.7 62.399 188.3 447. 23 Greece 337.0 302.4 63.559 170.6 434. 24 Turkey 148.0 87.8628 80.394 -78.8646 254.

13

Obs NAME Residual

1 UnitedStates 492. 2 Canada 16. 3 Iceland -42. 4 Sweden 133. 5 Switzerland -88. 6 Norway -134. 7 France 174. 8 Germany 39. 9 Luxemborg -156. 10 Netherland 33. 11 Austria 97. 12 Finland -42. 13 Australia 8. 14 Japan -138. 15 Belgium -5. 16 Italy -105. 17 Denmark -276. 18 UnitedKingdom -188. 19 NewZealand -44. 20 Ireland 105. 21 Spain -41. 22 Portugal 68. 23 Greece 34. 24 Turkey 60.

14

Estimating the common variance

  • One of the assumptions of linear regression is that the vari-

ance for each of the conditional distributions of Y |X is the

same at all values of X.

  • The estimate of this common variance is

SSE

n − 2

  • analogous to estimate of variance in a normal sample
  • n − 2 in denominator is “degrees of freedom”
    • number of observations minus number of estimated re-

gression coefficients

Inferences for the Slope

  • So far, we’ve been describing the relationship between two

continuous variables.

  • Now we want to perform a hypothesis test to determine

whether there is a linear relationship between the two vari-

ables.

  • depends on assumptions of linear regression
  • Question: Does the value of Y depend linearly on X?

E[Yi] = β 0 + β 1 Xi

  • Answer: Yes, unless β 1 = 0, in which case

E[Yi] = β 0

  • Hypotheses for test for linear relationship between Y and X

H 0 : β 1 = 0

HA : β 1 6 = 0

• Test statistic

t =

where ˆσβ 1 = √∑ σˆ

(Xi− X¯)^2

– standard form of test statistic: estimate divided by its

standard error

– standard error of βˆ 1 depends on

∗ variability of Ys

∗ how closely clustered the Xs are

– follows a t distribution with n - 2 degrees of freedom

∗ because we have to estimate 2 parameters (β 0 and

β 1 ) to compute ˆσ

– p-value: the probability of obtaining a t statistic as ex-

treme as, or more extreme than, what we got, if H 0 is

true

17

• Confidence interval for the slope:

– A (1 - α)% confidence interval for the true slope β 1 is

given by:

βˆ 1 ± (t

1 −(α/2),df =n− 2 )(ˆσβ 1 )

– If this C.I. includes the value 0, we cannot reject the null

hypothesis at significance level α.

18

Interpreting the test for zero slope

• Failure to reject H 0 : β 1 = 0

– Type II error

– X and Y related in a nonlinear way

– X provides little help in predicting Y

• Rejecting H 0 : β 1 = 0

– X provides significant information for predicting Y

– Although the data fit a linear model, some nonlinear

model may do even better

• Important caveat regarding inferences on β 1 : the best straight

line may be terrible!

Inferences concerning the regression

line

• Estimating the mean of the Y’s for a par-

ticular value of X, say X 0

– Example: what is the average per capita

health spending for a country with per

capita gross domestic product 10 PPP

E[Y |X 0 ] = YˆX

= βˆ 0 + βˆ 1 X 0

• estimated standard error of E[Y |X 0 ]

YX

√√ √√ √√ √√ √

n

(X 0 − X¯)^2 )

(Xi − X¯)

• A (1 - α)% confidence interval for E[Y |X 0 ]

is given by:

X 0 ±^ (t 1 −α/ 2 ,df =n− 2 )(ˆσ^ Yˆ

X 0

Plot showing 95% confidence limits

and 95% prediction limits