




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These are class notes for STAT 305, beautifully scribed by Eric Min and made available online for students. The notes cover various topics including ANOVA, hypothesis testing, lab reliability, contrasts, multiple comparisons, simple regression, errors in variables, and random effects. The notes were taken during the Autumn 2013 semester at the Department of Statistics, Stanford University. detailed explanations, examples, and formulas for each topic.
Typology: Study notes
1 / 148
This page cannot be seen from the preview
Don't miss anything!





























































































(^1) The class notes were beautifully scribed by Eric Min. He has kindly allowed his notes to be placed online for stat 305 students. Reading these at leasure, you will spot a few errors and omissions due to the hurried nature of scribing and probably my handwriting too. Reading them ahead of class will help you understand the material as the class proceeds. (^2) Department of Statistics, Stanford University.
0.0: Chapter 0:
0.0: CONTENTS Chapter 0: CONTENTS
1.3: Linearity Chapter 1: Overview
Proof.
E[(Y − m(x))^2 |X = x] =E[(Y − μ(x) + μ(x) − m(x))^2 |X = x] =E[(Y − μ(x))^2 |X = x]
This is the standard bias-variance trade-off. We cannot change variance. (We also assume that y has a finite variance.) However, our choice of m(x) can minimize the bias.
The proof above is slightly unsatisfactory since it already “knows” the conclusion. Instead, we could take a first-order condition:
d dm E[(Y − m)^2 |X = x] = 0 ⇒ m(x) = E[Y |X = x]
This yields an extremum which must obviously be a minimum.
We were dealing with mean squared error above. What about absolute error? For that, E[(Y − m)|X = x] is minimized by using m = med(Y |X = x) (the median). In some cases, we may want V(Y |X = x), or a quantile like Q^0.^999 (Y |X = x).
1.3 Linearity
Suppose we have data on boiling points of water at different levels of air pressure.
We can fit a line through this data and toss out all other potential information.
yi = β 0 + β 1 xi + εi
By drawing a single line, we also assume that the residual errors are independent and have the same variance. That is,
ε ∼ (0, σ^2 )
Chapter 1: Overview 1.4: Beyond Simple Linearity
Boiling point
Air pressure
Boiling point
Air pressure
Figure 1.1: Sample data of boiling points at different air pressures, with and without linear fit.
Doing so overlooks the possibility that each point’s error may actually not be random, but the effect of other factors we do not analyze directly (such as whether the experiment was done in the sun, done indoors, used a faulty thermometer, etc.). Nonetheless, the linear model is powerful in that it summarizes the data using three values: β 0 , β 1 , εi.
But what about the data in Figure 1.2?
Figure 1.2: Other data with a fitted line.
This is also a simple linear model; the key difference is that the variance in the residuals here is much, much higher.
1.4 Beyond Simple Linearity
The definition of a linear model goes further than a straight line. More generally,
yi = β 0 + βixi 1 + ... + βpxip + εi has p predictors and p − 1 parameters
Chapter 1: Overview 1.4: Beyond Simple Linearity
The logic above extends when having more than two groups. Let:
x 1 =
1 if group 2 0 otherwise
x 2 =
1 if group 3 0 otherwise
... xk− 1 =
1 if group k 0 otherwise
where one group is chosen as a point of reference. Then, we get that
E[Y ] = β 0 + β 1 x 1 + β 2 xx + ... + βk− 1 xk− 1
where group 1 has mean β 0 and group j > 1 has mean β 0 + βj− 1.
Another way to consider k groups is through the cell mean model, which we express as the following:
E[Y ] = β 11 {x = 1} + β 21 {x = 2} + ... + βk 1 {x = k}
Note that the cell mean model has no intercept term.
Suppose we are comparing a treatment and control group. It could be the case that both groups experience the same effect based on time, in which case the slopes of their two lines are the same (Figure 1.3a). But what if the two groups also have different slopes (Figure 1.3b)?
𝐸[𝑦]
𝑥
T
C
T’s gain T’s overall gain
C’s gain
(a) Same slopes can be dealt with using a dummy variable.
𝐸[𝑦]
𝑥 (b) Different slopes can be dealt with using interac- tions.
Figure 1.3: Two cases of different slopes.
E[Y ] = β 0 + β 1 x + β 2 z + β 2 xz
1.4: Beyond Simple Linearity Chapter 1: Overview
where x is time and z is an indicator for treatment. This allows for the two groups to have different slopes.
There may be cases where the slope of the line changes at a certain point. For instance, the performance of an average human kidney begins to decline at age 40. See Figure 1.5b. How can we express these kinds of situations?
E[Y ] = β 0 + β 1 x + β 2 (x − t)+ + εi where z+ = max(0, z) =
z z ≥ 0 0 z < 0
𝑧+
𝑧
Figure 1.4: Visualization of z+.
𝑡 0
𝛽 0 + 𝛽 1 𝑡 + 𝛽 2 𝑡 − 𝑡 0 +
(a) Accounting for sudden increase at a known t 0.
40
𝛽 0 + 𝛽 1 𝑡 − (^40) +
(b) Accounting for decline in kidney performance at
Figure 1.5: Examples of two-phase regression models.
What about cyclical data, such as calendar time? December 31 is not that distinct from January
1.5: Concluding Remarks Chapter 1: Overview
This can approximate piecewise functions and may be preferable to methods such as (excessive) polynomial regression.
Figure 1.7: Example of multiphase regression.
1.5 Concluding Remarks
Despite these models’ differences, the mathematics underlying them is all linear and practically the same.
Then what are examples of non-linear models?
1 − e−β^1 ti
∑k j=1 βj^ e
− 12 ||xi−μj ||^2 is almost linear but has a small Gaussian bump in the middle.
Last time, we discussed the big picture of applied statistics, with focus on the linear model.
Next, we will delve into more of the basic math, probability, computations, and geometry of the linear model. These components will be similar or the same across all forms of the model. We will then explore the actual ideas behind statistics; there, things will differ model by model.
There are six overall tasks we would like to perform:
Before anything, let’s set up some notation.
2.1 Linear Model Notation
Xi ∈ Rd^ Yi ∈ R
Yi =
∑^ p
j=
Zij βj + εi where Zij = jth^ function of Xi
We also call these jth^ functions features. Note that the dimensions of X (d) do not have to be equal to the number of features (p). For example,
Zi =
1 xi 1 ... xip
Zi =
1 xi 1 xi 2 x^2 i 1 x^2 i 2
(quadratic regression)
In the first case, p = d + 1. In the second, p = 5 and d = 2. Features are not the same as variables.
Chapter 2: Setting Up the Linear Model 2.4: Math Review
Yi =
∑^ p
j=
Zij βj + εi where εi i.i.d. ∼ N (0, σ^2 )
Note that the i.i.d. notation for εi is interchangeable with “ind.” However, both of these are
distinct from εi i.i.d. ∼ N (0, σ^2 i ) (where each error term has its own variance) and εi ind. ∼ (0, σ^2 ) (where the distribution is unknown).
Second,
y 1 y 2 .. . yn
z 11 ... zip .. .
zn 1 ... znp
β^ =
β 1 .. . βp
ε^ =
ε 1 .. . εp
Third,
A = Zβ + ε where ε ∼ N (0, σ^2 I)
We can also express this third version as A ∼ N (Zβ, σ^2 I).
Let’s dig deeper into the vector/matrix form of the regression model. But first, we require some probability and matrix algebra review.
2.4 Math Review
Suppose that A ∈ Rm×n^ is a random m×n matrix. Expected values on this matrix work component- wise:
x 11 ... x 1 n .. .
xm 1 ... xmn
E[x 11 ] ... E[x 1 n] .. .
E[xm 1 ] ... E[xmn]
Now, suppose there exist two non-random matrices A and B. Then,
Say that A ∈ Rn^ and A ∈ Rn^ are random column vectors. Then,
2.4: Math Review Chapter 2: Setting Up the Linear Model
The variance matrix V(X) is positive semi-definite. That is, for a vector C,
0 ≤ V(CT^ X) = CT^ V(X)C
Indeed, V(X) is positive definite unless V(CT^ X) = 0 for some C 6 = 0.
Suppose that A ∈ Rn×n. Written explicitly, the quadratic form is a scalar value defined as
xT^ Ax =
∑^ m
i=
∑^ n
j=
Aij xixj
We can assume symmetry for A; Aij = Aji. (If it is not symmetric, we can instead use A+A T
This doesn’t look directly relevant to us. So why do we care about quadratic forms? Consider how we estimate variance. We know that
σˆ^2 ∝
∑^ n
i=
(yi − Y )^2 where ¯y =
n
∑^ n
i=
yi
In matrix notation, this variance estimate can be written as:
y 1 .. . yn
T
1 − (^1) n − (^) n^1 ... − (^) n^1 − (^1) n 1 − (^) n^1
− (^1) n ... ... 1 − (^) n^1
y 1 .. . yn
∑^ n
i=
yi(yi − y¯)
(^1) This is true because xT^ Ax = (xT^ Ax)T^ = xT^ AT^ x = xT
( 1 2 A^ +
1 2 A
T
) x
The first equality is true because the transpose of a scalar is itself. The second equality is from the fact that we are averaging two quantities which are themselves equal. Based on this, we know that only the symmetric part of A contributes to the quadratic form.