STATS 305 Notes1, Study notes of Calculus

These are class notes for STAT 305, beautifully scribed by Eric Min and made available online for students. The notes cover various topics including ANOVA, hypothesis testing, lab reliability, contrasts, multiple comparisons, simple regression, errors in variables, and random effects. The notes were taken during the Autumn 2013 semester at the Department of Statistics, Stanford University. detailed explanations, examples, and formulas for each topic.

Typology: Study notes

2012/2013

Uploaded on 05/11/2023

alpana
alpana 🇺🇸

4.9

(13)

249 documents

1 / 148

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STATS 305 Notes1
Art Owen2
Autumn 2013
1The class notes were beautifully scribed by Eric Min. He has kindly allowed his notes to be placed online
for stat 305 students. Reading these at leasure, you will spot a few errors and omissions due to the hurried
nature of scribing and probably my handwriting too. Reading them ahead of class will help you understand
the material as the class proceeds.
2Department of Statistics, Stanford University.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download STATS 305 Notes1 and more Study notes Calculus in PDF only on Docsity!

STATS 305 Notes^1

Art Owen^2

Autumn 2013

(^1) The class notes were beautifully scribed by Eric Min. He has kindly allowed his notes to be placed online for stat 305 students. Reading these at leasure, you will spot a few errors and omissions due to the hurried nature of scribing and probably my handwriting too. Reading them ahead of class will help you understand the material as the class proceeds. (^2) Department of Statistics, Stanford University.

0.0: Chapter 0:

  • 1 Overview
    • 1.1 The Math of Applied Statistics
    • 1.2 The Linear Model
      • 1.2.1 Other Extensions
    • 1.3 Linearity
    • 1.4 Beyond Simple Linearity
      • 1.4.1 Polynomial Regression
      • 1.4.2 Two Groups
      • 1.4.3 k Groups
      • 1.4.4 Different Slopes
      • 1.4.5 Two-Phase Regression
      • 1.4.6 Periodic Functions
      • 1.4.7 Haar Wavelets
      • 1.4.8 Multiphase Regression
    • 1.5 Concluding Remarks
  • 2 Setting Up the Linear Model
    • 2.1 Linear Model Notation
    • 2.2 Two Potential Models
      • 2.2.1 Regression Model
      • 2.2.2 Correlation Model
    • 2.3 TheLinear Model
    • 2.4 Math Review
      • 2.4.1 Quadratic Forms
  • 3 The Normal Distribution
    • 3.1 Friends of N (0, 1)
      • 3.1.1 χ
      • 3.1.2 t-distribution
      • 3.1.3 F -distribution
    • 3.2 The Multivariate Normal
      • 3.2.1 Linear Transformations
      • 3.2.2 Normal Quadratic Forms
      • 3.2.3 Rotation
      • 3.2.4 More on Independence
      • 3.2.5 Conditional Distributions
    • 3.3 Non-Central Distributions
      • 3.3.1 Non-Central χ 0.0: CONTENTS Chapter 0: CONTENTS
      • 3.3.2 Non-Central F Distribution
      • 3.3.3 Doubly Non-Central F Distribution
  • 4 Linear Least Squares
    • 4.1 The Best Estimator
      • 4.1.1 Calculus Approach
      • 4.1.2 Geometric Approach
      • 4.1.3 Distributional Theory of Least Squares
    • 4.2 The Hat Matrix and t-tests
    • 4.3 Distributional Results
      • 4.3.1 Distribution of ˆε
    • 4.4 Applications
      • 4.4.1 Another Approach to the t-statistic
    • 4.5 Examples of Non-Uniqueness
      • 4.5.1 The Dummy Variable Trap
      • 4.5.2 Correlated Data
    • 4.6 Extra Sum of Squares
    • 4.7 Gauss Markov Theorem
    • 4.8 Computing LS Solutions
      • 4.8.1 The SVD
    • 4.9 R^2 and ANOVA Decomposition
  • 5 Intro to Statistics
    • 5.1 Mean and Variance
    • 5.2 A Staircase
    • 5.3 Testing
      • 5.3.1 One Sample t-test
    • 5.4 Practical vs. Statistical Significance
    • 5.5 Confidence Intervals
  • 6 Power and Correlations
    • 6.1 Significance
      • 6.1.1 Finding Non-Significance
      • 6.1.2 Brief History of the t-test
    • 6.2 Power
      • 6.2.1 Power Example
    • 6.3 Variance Estimation
    • 6.4 Correlations in Yi
      • 6.4.1 Autoregressive Model
      • 6.4.2 Moving Average Model
      • 6.4.3 Non-Normality
      • 6.4.4 Outliers
  • 7 Two-Sample Tests
    • 7.1 Setup
    • 7.2 Welch’s t
    • 7.3 Permutation Tests
  • 8 k Groups Chapter 0: CONTENTS 0.0: CONTENTS
    • 8.1 ANOVA Revisited
      • 8.1.1 Cell Means Model
      • 8.1.2 Effects Model
    • 8.2 Hypothesis Testing
    • 8.3 Lab Reliability
    • 8.4 Contrasts
      • 8.4.1 Another Example
      • 8.4.2 The Most Sensitive Contrast
    • 8.5 Some Recap
      • 8.5.1 The “Cheat” Contrast
    • 8.6 Multiple Comparisons
      • 8.6.1 t-tests
      • 8.6.2 Fisher’s Least Significant Difference (LSD)
      • 8.6.3 Bonferroni’s Union Bound
      • 8.6.4 Tukey’s Standardized Range Test
      • 8.6.5 Scheff´e
      • 8.6.6 Benjamini and Hochberg
    • 8.7 Creating Groups
  • 9 Simple Regression
    • 9.1 Regression Fallacy
    • 9.2 The Linear Model
      • 9.2.1 The Easier Way
    • 9.3 Variance of Estimates
      • 9.3.1 βˆ
      • 9.3.2 βˆ
      • 9.3.3 βˆ 0 + βˆ 1 x
    • 9.4 Simultaneous Bands
    • 9.5 Calibration
    • 9.6 R^2 for x ∈ R
    • 9.7 Regression through the Origin
      • 9.7.1 The Futility of R
  • 10 Errors in Variables
    • 10.1 The Normal Case
  • 11 Random Effects
    • 11.1 The Model
    • 11.2 Estimation
    • 11.3 Two-Factor ANOVA
      • 11.3.1 Fixed × Fixed
      • 11.3.2 Random × Random
      • 11.3.3 Random × Fixed
      • 11.3.4 Other Remarks
  • 12 Multiple Regression
    • 12.1 Example: Body Fat
    • 12.2 Some Considerations 0.0: CONTENTS Chapter 0: CONTENTS
      • 12.2.1 A “True” βj
      • 12.2.2 Naive Face-Value Interpretation
      • 12.2.3 Wrong Signs
      • 12.2.4 Correlation, Association, Causality
  • 13 Interplay between Variables
    • 13.1 Simpson’s Paradox
      • 13.1.1 Hospitals
      • 13.1.2 Smoking and Cancer
    • 13.2 Competition and Collaboration
      • 13.2.1 Competition
      • 13.2.2 Collaboration
  • 14 Automatic Variable Selection
    • 14.1 Stepwise Methods
    • 14.2 Mallow’s Cp
    • 14.3 Least Squares Cross-Validation
      • 14.3.1 The Short Way
      • 14.3.2 Algebraic Aside
    • 14.4 Generalized CV
    • 14.5 AIC and BIC
      • 14.5.1 AIC vs. BIC
    • 14.6 Ridge Regression
      • 14.6.1 Optimization
      • 14.6.2 A Bayesian Connection
      • 14.6.3 An Alternative
    • 14.7 Principal Components
    • 14.8 L 1 Penalties
  • 15 Causality
    • 15.1 Regression Discontinuity Designs
  • 16 Violations of Assumptions
    • 16.1 Bias
      • 16.1.1 Detection
      • 16.1.2 Transformations
    • 16.2 Heteroskedasticity
      • 16.2.1 A Special Case
      • 16.2.2 Consequences
      • 16.2.3 Detection
    • 16.3 Non-Normality
      • 16.3.1 Detection
    • 16.4 Outliers
      • 16.4.1 Detection
      • 16.4.2 Potential Solutions
  • 17 Bootstrapped Regressions
    • 17.1 Bootstrapped Pairs Chapter 0: CONTENTS 0.0: CONTENTS
    • 17.2 Bootstrapped Residuals
      • 17.2.1 Unequal Variances
    • 17.3 Wild Bootstrap
    • 17.4 Weighted Likelihood Bootstrap
  • 18 Course Summary
    • 18.1 The Steps

0.0: CONTENTS Chapter 0: CONTENTS

1.3: Linearity Chapter 1: Overview

Proof.

E[(Y − m(x))^2 |X = x] =E[(Y − μ(x) + μ(x) − m(x))^2 |X = x] =E[(Y − μ(x))^2 |X = x]

  • 2E[(Y − μ(x))(μ(x) − m(x))|X = x]
  • (μ(x) − m(x))^2 =V(Y |X = x) + 2(0) + (μ(x) − m(x))^2 =V(Y |X = x) + Bias^2

This is the standard bias-variance trade-off. We cannot change variance. (We also assume that y has a finite variance.) However, our choice of m(x) can minimize the bias.

1.2.1 Other Extensions

The proof above is slightly unsatisfactory since it already “knows” the conclusion. Instead, we could take a first-order condition:

d dm E[(Y − m)^2 |X = x] = 0 ⇒ m(x) = E[Y |X = x]

This yields an extremum which must obviously be a minimum.

We were dealing with mean squared error above. What about absolute error? For that, E[(Y − m)|X = x] is minimized by using m = med(Y |X = x) (the median). In some cases, we may want V(Y |X = x), or a quantile like Q^0.^999 (Y |X = x).

1.3 Linearity

Suppose we have data on boiling points of water at different levels of air pressure.

We can fit a line through this data and toss out all other potential information.

yi = β 0 + β 1 xi + εi

By drawing a single line, we also assume that the residual errors are independent and have the same variance. That is,

ε ∼ (0, σ^2 )

Chapter 1: Overview 1.4: Beyond Simple Linearity

Boiling point

Air pressure

Boiling point

Air pressure

Figure 1.1: Sample data of boiling points at different air pressures, with and without linear fit.

Doing so overlooks the possibility that each point’s error may actually not be random, but the effect of other factors we do not analyze directly (such as whether the experiment was done in the sun, done indoors, used a faulty thermometer, etc.). Nonetheless, the linear model is powerful in that it summarizes the data using three values: β 0 , β 1 , εi.

But what about the data in Figure 1.2?

Figure 1.2: Other data with a fitted line.

This is also a simple linear model; the key difference is that the variance in the residuals here is much, much higher.

1.4 Beyond Simple Linearity

The definition of a linear model goes further than a straight line. More generally,

yi = β 0 + βixi 1 + ... + βpxip + εi has p predictors and p − 1 parameters

Chapter 1: Overview 1.4: Beyond Simple Linearity

1.4.3 k Groups

The logic above extends when having more than two groups. Let:

x 1 =

1 if group 2 0 otherwise

x 2 =

1 if group 3 0 otherwise

... xk− 1 =

1 if group k 0 otherwise

where one group is chosen as a point of reference. Then, we get that

E[Y ] = β 0 + β 1 x 1 + β 2 xx + ... + βk− 1 xk− 1

where group 1 has mean β 0 and group j > 1 has mean β 0 + βj− 1.

Another way to consider k groups is through the cell mean model, which we express as the following:

E[Y ] = β 11 {x = 1} + β 21 {x = 2} + ... + βk 1 {x = k}

Note that the cell mean model has no intercept term.

1.4.4 Different Slopes

Suppose we are comparing a treatment and control group. It could be the case that both groups experience the same effect based on time, in which case the slopes of their two lines are the same (Figure 1.3a). But what if the two groups also have different slopes (Figure 1.3b)?

𝐸[𝑦]

𝑥

T

C

T’s gain T’s overall gain

C’s gain

(a) Same slopes can be dealt with using a dummy variable.

𝐸[𝑦]

𝑥 (b) Different slopes can be dealt with using interac- tions.

Figure 1.3: Two cases of different slopes.

E[Y ] = β 0 + β 1 x + β 2 z + β 2 xz

1.4: Beyond Simple Linearity Chapter 1: Overview

where x is time and z is an indicator for treatment. This allows for the two groups to have different slopes.

1.4.5 Two-Phase Regression

There may be cases where the slope of the line changes at a certain point. For instance, the performance of an average human kidney begins to decline at age 40. See Figure 1.5b. How can we express these kinds of situations?

E[Y ] = β 0 + β 1 x + β 2 (x − t)+ + εi where z+ = max(0, z) =

z z ≥ 0 0 z < 0

𝑧+

𝑧

Figure 1.4: Visualization of z+.

𝑡 0

𝛽 0 + 𝛽 1 𝑡 + 𝛽 2 𝑡 − 𝑡 0 +

(a) Accounting for sudden increase at a known t 0.

40

𝛽 0 + 𝛽 1 𝑡 − (^40) +

(b) Accounting for decline in kidney performance at

Figure 1.5: Examples of two-phase regression models.

1.4.6 Periodic Functions

What about cyclical data, such as calendar time? December 31 is not that distinct from January

  1. How can we deal with an x that has a torus-like shape?

1.5: Concluding Remarks Chapter 1: Overview

This can approximate piecewise functions and may be preferable to methods such as (excessive) polynomial regression.

Figure 1.7: Example of multiphase regression.

1.5 Concluding Remarks

Despite these models’ differences, the mathematics underlying them is all linear and practically the same.

Then what are examples of non-linear models?

  • yi = β 0

1 − e−β^1 ti

  • εi is not linear because β 1 is in an exponent.
  • yi = β 0 + β 1 xi 1 + β 2 (xi − β 3 )+ + εi is not linear because of β 3.
  • yi +

∑k j=1 βj^ e

− 12 ||xi−μj ||^2 is almost linear but has a small Gaussian bump in the middle.

Chapter 2

Setting Up the Linear Model

Last time, we discussed the big picture of applied statistics, with focus on the linear model.

Next, we will delve into more of the basic math, probability, computations, and geometry of the linear model. These components will be similar or the same across all forms of the model. We will then explore the actual ideas behind statistics; there, things will differ model by model.

There are six overall tasks we would like to perform:

  1. Estimate β
  2. Test, e.g. β 7 = 0
  3. Predict yn+1 given xn+
  4. Estimate σ^2
  5. Check the model’s assumptions
  6. Make a choice among linear models

Before anything, let’s set up some notation.

2.1 Linear Model Notation

Xi ∈ Rd^ Yi ∈ R

Yi =

∑^ p

j=

Zij βj + εi where Zij = jth^ function of Xi

We also call these jth^ functions features. Note that the dimensions of X (d) do not have to be equal to the number of features (p). For example,

Zi =

[

1 xi 1 ... xip

]

Zi =

[

1 xi 1 xi 2 x^2 i 1 x^2 i 2

]

(quadratic regression)

In the first case, p = d + 1. In the second, p = 5 and d = 2. Features are not the same as variables.

Chapter 2: Setting Up the Linear Model 2.4: Math Review

Yi =

∑^ p

j=

Zij βj + εi where εi i.i.d. ∼ N (0, σ^2 )

Note that the i.i.d. notation for εi is interchangeable with “ind.” However, both of these are

distinct from εi i.i.d. ∼ N (0, σ^2 i ) (where each error term has its own variance) and εi ind. ∼ (0, σ^2 ) (where the distribution is unknown).

Second,

Y =

y 1 y 2 .. . yn

Z =

z 11 ... zip .. .

zn 1 ... znp

 β^ =

β 1 .. . βp

 ε^ =

ε 1 .. . εp

Third,

A = Zβ + ε where ε ∼ N (0, σ^2 I)

We can also express this third version as A ∼ N (Zβ, σ^2 I).

Let’s dig deeper into the vector/matrix form of the regression model. But first, we require some probability and matrix algebra review.

2.4 Math Review

Suppose that A ∈ Rm×n^ is a random m×n matrix. Expected values on this matrix work component- wise:

E

x 11 ... x 1 n .. .

xm 1 ... xmn

E[x 11 ] ... E[x 1 n] .. .

E[xm 1 ] ... E[xmn]

Now, suppose there exist two non-random matrices A and B. Then,

E[AX] = A E[X] E[XB] = E[X] B

Say that A ∈ Rn^ and A ∈ Rn^ are random column vectors. Then,

2.4: Math Review Chapter 2: Setting Up the Linear Model

  • Cov(X, Y ) ≡ E[(X − E[X])(Y − E[Y ])T^ ] ∈ Rn×m
  • Var(Y ) = Cov(Y, Y ) ∈ Rm×m
    • Variances are on the diagonal, and covariances are on the off-diagonal.
  • Cov(AX, BY ) = A Cov(X, Y ) BT
  • V(AX + b) = AV(X)AT^ (where A is random and b is non-random)

The variance matrix V(X) is positive semi-definite. That is, for a vector C,

0 ≤ V(CT^ X) = CT^ V(X)C

Indeed, V(X) is positive definite unless V(CT^ X) = 0 for some C 6 = 0.

2.4.1 Quadratic Forms

Suppose that A ∈ Rn×n. Written explicitly, the quadratic form is a scalar value defined as

xT^ Ax =

∑^ m

i=

∑^ n

j=

Aij xixj

We can assume symmetry for A; Aij = Aji. (If it is not symmetric, we can instead use A+A T

This doesn’t look directly relevant to us. So why do we care about quadratic forms? Consider how we estimate variance. We know that

σˆ^2 ∝

∑^ n

i=

(yi − Y )^2 where ¯y =

n

∑^ n

i=

yi

In matrix notation, this variance estimate can be written as:

  

y 1 .. . yn

T

1 − (^1) n − (^) n^1 ... − (^) n^1 − (^1) n 1 − (^) n^1

− (^1) n ... ... 1 − (^) n^1

y 1 .. . yn

∑^ n

i=

yi(yi − y¯)

(^1) This is true because xT^ Ax = (xT^ Ax)T^ = xT^ AT^ x = xT

( 1 2 A^ +

1 2 A

T

) x

The first equality is true because the transpose of a scalar is itself. The second equality is from the fact that we are averaging two quantities which are themselves equal. Based on this, we know that only the symmetric part of A contributes to the quadratic form.