introduction in mathematical statistics, Lecture notes of Mathematical Statistics

mainly probability and stats mathematical formulas , statistical data chart etc

Typology: Lecture notes

2022/2023

Uploaded on 12/05/2022

nosmilo-mavimbela
nosmilo-mavimbela 🇸🇿

1 document

1 / 122

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture Notes on Mathematical Statistics
Shuyang Ling
December 23, 2020
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download introduction in mathematical statistics and more Lecture notes Mathematical Statistics in PDF only on Docsity!

Lecture Notes on Mathematical Statistics

Shuyang Ling

December 23, 2020

Contents

6.6.2 Inference in logistic regression.................... 114 6.6.3 Hypothesis testing.......................... 116 6.6.4 Repeated observations - Binomial outcomes............ 116 6.6.5 General logistic regression...................... 118

This lecture note draft is prepared for MATH-SHU 234 Mathematical Statistics I am teaching at NYU Shanghai. It covers the basics of mathematical statistics at undergrad- uate level.

Chapter 1

Probability

1.1 Probability

Probability theory is the mathematical foundation of statistics. We will review the basics of concepts in probability before we proceed to discuss mathematical statistics.

The core idea of probability theory is studying the randomness. The randomness is described by random variable X, a function from sample space to a number. Each random variable X is associated with a distribution function.

We define the cumulative distribution function (cdf) of X as:

FX (x) = P(X ≤ x). (1.1.1)

The cdf satisfies three properties:

  • FX (x) is non-decreasing
  • FX (x) is right-continuous
  • Limits at the infinity:

lim x→−∞

FX (x) = 0, lim x→∞

FX (x) = 1.

A cdf uniquely determines a random variable; it can be used to compute the probability of X belonging to a certain range

P(a < X ≤ b) = FX (b) − FX (a).

In many applications, we often encounter two important classes of random variables, discrete and continuous random variables.

We say X is a discrete random variable if X takes value from a countable set of num- bers X = {a 1 , a 2 , · · · , an, · · · }.

The probability of X taking value ai is given by

fX (i) = pi = P(X = ai)

For a function ϕ(x) : X → R, we have

E ϕ(X) =

R

ϕ(x)fX (x) dx.

The variance, as a measure of uncertainty, is

Var(X) = E(X − E X)^2 = E ϕ(X), ϕ(x) = (x − E X)^2.

We sometimes use another form

Var(X) = E X^2 − (E X)^2.

Here E X^2 is referred as the second moment. The p-th moment is defined as

E Xp^ =

R

xp^ dFX =

{∑n i=1 pia

p ´ i^ ,^ discrete R x

pfX (x) dx, continuous

Independence: Independence is an important concept in probability. Two random variables X and Y are independent if

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B), ∀A, B.

This is equivalent to

P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y) = FX (x)FY (y), ∀x, y,

i.e., the joint cdf of (X, Y ) equals the product of its marginal distributions.

Suppose X and Y are independent, then f (X) and g(Y ) are also independent for two functions f and g. As a result, we have

E f (X)g(Y ) = E f (X) E g(Y ).

Given a sequence of n random variables {Xi}ni=1, they are independent if

P(Xi ≤ xi, 1 ≤ i ≤ n) =

∏^ n

i=

P(Xi ≤ xi).

If Xi is discrete or continuous, then the independence can be characterized by using pmf and pdf:

fX 1 ,··· ,Xn (x 1 , · · · , xn) =

∏^ n

i=

fXi (xi).

The joint pdf/pmf is the product of individual pdf/pmf’s (marginal distribution).

In probability and statistics, we often study the sum of i.i.d. (independent identically distributed) random variables

∑n i=1 Xi.

Exercise: Denote Zn =

∑n i=1 Xi^ as the sum of^ n^ i.i.d. random variables. Then^ E^ Zn^ =^ nμ and Var(Zn) = nσ^2.

We will see more in the next few sections.

1.2 Important distributions

1.2.1 Uniform distribution

If the pdf of a random variable X satisfies

fX (x) =

b − a

, a ≤ x ≤ b.

This is called the uniform distribution, denoted by Unif[a, b]. Its cdf is

FX (x) =

0 , x ≤ a x−a b−a ,^ a < x < b 1 , x ≥ b.

Exercise: Show that E X = (a + b)/2 and Var(X) = (b − a)^2 / 12.

1.2.2 Normal distribution/Gaussian distribution

Normal distribution is the most important distribution in probability and statistics. It has extremely rich structures and connections with other distributions. A random variable X is Gaussian with mean μ and variable σ^2 , denoted by N (μ, σ^2 ), if its pdf is

fX (x) =

2 πσ

e−(x−μ)

(^2) / 2 σ 2 , x ∈ R.

In particular, if μ = 0 and σ = 1, we say X is standard Gaussian. One can verify

1 √ 2 π

R

e−x

(^2) / 2 dx = 1

by using the trick from multivariate calculus. Let’s verify E X = 0 and Var(X) = 1.

E X =

2 π

R

xe−x

(^2) / 2 dx = 0

since xe−x (^2) / 2 is an odd function. How about E X^2?

E X^2 =

2 π

R

x^2 e−x

(^2) / 2 dx

2 π

R

x de−x

(^2) / 2

2 π

xe−x

∞ −∞

2 π

R

e−x

(^2) / 2 dx = 1.

Gaussian random variable is linearly invariant: suppose X ∼ N (μ, σ^2 ), then aX + b is still Gaussian with mean aμ + b and variance a^2 σ^2 , i.e., N (aμ + b, a^2 σ^2 )

E(aX + b) = aμ + b, Var(aX + b) = Var(aX) = a^2 Var(X) = a^2 σ^2.

Moreover, suppose X ∼ N (μX , σ X^2 ) and Y ∼ N (μY , σ^2 Y ) are two independent random variables, then X + Y ∼ N (μX + μY , σ^2 X + σ^2 Y ).

This can be extended to the sum of n independent Gaussian random variables. For example, ∑n

i=

Xi ∼ N (0, n)

if Xi ∼ N (0, 1) are i.i.d. random variables.

In particular, if n is a positive integer, Γ(n) = (n − 1)! and Γ(1/2) =

π.

Chi-squared distribution is closely connected to normal distribution. Suppose Z ∼ N (0, 1). Now we take a look at X = Z^2 :

P(X ≤ x) = P(Z^2 ≤ x) = P(−

x ≤ Z ≤

x) = 2P(0 ≤ Z ≤

x)

=

π

ˆ √x

0

e−z

(^2) / 2 dz.

The pdf of X is obtained by differentiating the cdf,

fX (x) =

π

x

· e−x/^2 =

2 π

x−^1 /^2 e−x/^2 , x > 0.

Now if {Zi}ni=1 is a sequence of n independent standard normal random variables, then

X =

∑^ n

i=

Z^2 i ∼ χ^2 n.

Chi-squared distribution is a special family of Gamma distribution Γ(α, β).

fX (x; α, β) =

Γ(α)βα^

xα−^1 e−x/β^ , x > 0.

If β = 2 and α = n/2, then Γ(n/ 2 , 2) = χ^2 n.

Exercise: Show that E etX^ = (1 − βt)−α, for t < 1 /β.

Exercise: Show that

∑n i=1 Xi^ ∼^ Γ(

∑n i=1 αi, β) if^ Xi^ ∼^ Γ(αi, β) are independent.

1.2.5 Exponential distribution

Exponential distribution: X has an exponential distribution with parameter β, i.e., E(β) if f (x) = β−^1 e−x/β^ , x ≥ 0

where β > 0.

  • The exponential distribution is used to model the waiting time of a certain event (lifetimes of electronic components).
  • The waiting time of a bus arriving at the station.

It is also a special case of Gamma distribution Γ(1, β). Exponential distribution satisfies the so-called memoryless property:

P(X ≥ t + s|X ≥ t) = P(X ≥ s), ∀s ≥ 0.

Recall that the left side involves conditional probability. For two events A and B, the conditional probability of A given B is

P(A|B) =

P(A ∩ B)

P(B)

Here

P(X ≥ t + s|X ≥ t) =

P(X ≥ t + s, X ≥ t) P(X ≥ t)

P(X ≥ t + s) P(X ≥ t)

since {X ≥ t + s} is contained in {X ≥ t}

Exercise: Verify the memoryless properties and think about what does it mean?

Exercise: What is the distribution of

∑n i=1 Xi^ if^ Xi^ ∼ E(β)?

Exercise: Verify E X = β and Var(X) = β^2 for X ∼ E(β).

1.2.6 Bernoulli distributions

Let X represent the outcome of a binary coin flip. Then its pmf is

P(X = 1) = p, P(X = 0) = 1 − p.

Sometimes, we also write the pmf in this way:

fX (x) = px(1 − p)^1 −x, x ∈ { 0 , 1 }.

The coin is fair if p = 1/2. The cdf is

FX (x) =

0 , x < 0 , 1 − p, 0 ≤ x < 1 , 1 , x ≥ 1.

In this case, we denote X ∼Bernoulli(p). The mean and variance of X is simple to obtain: E(X) = 1 · P(X = 1) + 0 · P(X = 0) = p

and Var(X) = E X^2 − (E X)^2 = E X − p^2 = p(1 − p).

1.2.7 Binomial distribution

Suppose we have a coin which falls heads up with probability p. Flip the coin n times and X is the number of heads. Each outcome is supposed to be independent.

If X = k, then there must be k heads and n − k tails:

P(X = k) =

n k

pk(1 − p)n−k

where (^) ( n k

n! k!(n − k)!

Then its pmf is

fX (k) =

n k

pk(1 − p)n−k, k ∈ { 0 , 1 , · · · , n}.

In this case, we denote X ∼Binomial(n, p).

Exercise: Show

∑n k=

(n k

pk(1 − p)n−k^ = 1.

1.3 Limiting theorem

1.3.1 Law of large number

Law of large numbers, along with central limit theorem (CLT), plays fundamental roles in statistical inference and hypothesis testing.

Theorem 1.3.1 (Weak law of large number). Let Xi be a sequence of i.i.d. random variables,

lim n→∞

n

∑^ n

i=

Xi p −→ μ,

i.e., convergence in probability, where μ = E Xi.

We say a sequence of random variables Xn converge to X in probability if for any given  > 0 , lim n→∞ P(|Xn − X| ≥ ) = 0.

Law of large number basically says that the sample average converges to the expected value as the sample size grows to infinity.

We can prove the law of large number easily if assuming Xi has a finite second moment. The proof relies on Chebyshev’s inequality.

Theorem 1.3.2 (Chebyshev’s inequality). For a random variable X with finite second moment, then

P(|X − μ| ≥ ) ≤

E |X − μ|^2 ^2

σ^2 ^2

Proof of WLLN. Consider Xn = n−^1

∑n i=1 Xi. We aim to prove^ Xn^ converges to^ μ^ in probability if Var(Xi) < ∞. For  > 0, it holds that

P(|Xn − μ| ≥ ) ≤

^2

E |Xn − μ|^2.

It suffices to compute the variance of Xn:

E |Xn − μ|^2 = E

n−^1

∑^ n

i=

Xi − μ

2

= n−^2 E

[ (^) n ∑

i=

(Xi − μ)

] 2

= n−^2 ·

∑^ n

i=

E(Xi − μ)^2

= n−^2 · nσ^2 =

σ^2 n

As a result,

P(|Xn − μ| ≥ ) ≤

σ^2 n^2

→ 0 , n → ∞.

1.3.2 Central limit theorem

Theorem 1.3.3 (Central limit theorem). Let Xi, 1 ≤ i ≤ n be a sequence of i.i.d. random variables with mean μ and finite variance σ^2 , then

Zn :=

∑n i=1 √ Xi^ −^ nμ nσ^2

−→ N^ d (0, 1),

i.e., convergence in distribution.

Sometimes, we also use

Zn =

n(Xn − μ) σ

We say a sequence of random variable Zn converges to Z in distribution if

lim n→∞ P(Zn ≤ z) = P(Z ≤ z)

for any z ∈ R. In other words, the cdf of Zn converges to that of Z pointwisely,

lim n→∞ FXn (x) = FX (x).

What does it mean?

lim n→∞

P(a ≤ Zn ≤ b) = P(a ≤ Z ≤ b) =

2 π

ˆ (^) b

a

e−t (^2) / 2 dt.

In statistics, one useful choice of a and b are

a = zα/ 2 , b = z 1 −α/ 2.

Here zα is defined as the α-quantile of normal random variable, i.e.,

P(Z ≤ zα) =

2 π

ˆ (^) zα

−∞

e−t

(^2) / 2 dt = α, 0 ≤ α ≤ 1

and by symmetry, we have zα = −z 1 −α.

In particular, z 0. 975 ≈ 1. 96.

In other words, as n is sufficiently large, with probability approximately 1 − α, it holds that

zα/ 2 ≤

n(Xn − μ) σ

≤ z 1 −α/ 2 ⇐⇒ |Xn − μ| ≤

σz 1 −α/ 2 √ n

, zα/ 2 = −z 1 −α/ 2

which implies that the “error” decays at the rate of 1/

n.

Theorem 1.3.4. Convergence in probability implies convergence in distribution.

Exercise: Show that E Zn = 0 and Var(Zn) = 1.

The proof of CLT relies on the moment generating function. We can show that the MGF of Zn converges to that of a standard normal. Here we provide a sketch of the proof.

Chapter 2

Introduction to statistics

2.1 Population

One core task of statistics is making inferences about an unknown parameter θ associated to a population. What is a population? In statistics, a population is a set consisting of the entire similar items we are interested in. For example, a population may refer to all the college students in Shanghai or all the residents in Shanghai. The choice of the population depends on the actual scientific problem.

Suppose we want to know the average height of all the college students in Shanghai or want to know the age distribution of residents in Shanghai. What should we do? Usually, a population is too large to deal with directly. Instead, we often draw samples from the population and then use the samples to estimate a population parameter θ such as mean, variance, median, or even the actual distribution.

This leads to several important questions in statistics:

  • How to design a proper sampling procedure to collect data? Statistical/Experimental design.
  • How to use the data to estimate a particular population parameter?
  • How to evaluate the quality of an estimator?

2.1.1 Important statistics

If we get a dataset, we usually compute the basic statistics to roughly describe the dataset.

  • Sample mean/average:

xn =

n

∑^ n

i=

xi

  • Variance: S n^2 =

n − 1

∑^ n

i=

(xi − xn)^2

  • Standard deviation:

Sn =

n − 1

∑^ n

i=

(xi − xn)^2

  • Median: median(xi) = x[(n+1)/2] where [(n + 1)/2] means the closest integer to (n + 1)/2. More generally, the α quantile is x[α(n+1)].
  • Range, max/min: Range = max 1 ≤i≤n xi − min 1 ≤i≤n xi.
  • Empirical cdf:

Fn(x) =

n

∑^ n

i=

1 {xi ≤ x}.

Exercise: Show that xn minimizes

f (z) =

∑^ n

i=

(xi − z)^2

Exercise: Show that median(xi) minimizes

f (z) =

∑^ n

i=

|xi − z|

Is the global minimizer unique?

Notice that all the quantities above are based on samples {x 1 , · · · , xn}. These quantities are called statistics.

Definition 2.1.1. A statistic is a deterministic function of samples,

y = T (x 1 , · · · , xn),

which is used to estimate the value of a population parameter θ.

Question: How to evaluate the quality of these estimators?

2.1.2 Probabilistic assumption

We assume the population has a probability distribution FX and each observed sample xi is a realization of a random variable Xi obeying the population distribution FX. A set of samples {x 1 , · · · , xn} are treated as one realization of a random sequence {X 1 , · · · , Xn}. From now on, we assume that all the random variables Xi are i.i.d., independent identi- cally distributed.

In other words, T (x 1 ,... , xn) is one copy of a random variable

̂ θn = T (X 1 , · · · , Xn),

which is a point estimator of θ. We ask several questions:

  • Does θ̂n well approximate the population parameter θ?
  • How to evaluate the quality of the estimators?

Theorem 2.2.1 (Continuous mapping theorem). Suppose g is a continuous function and

Xn p −→ X, then g(Xn) p −→ g(X). This also applies to convergence in distribution.

Remark: this is also true for random vectors. Suppose Xn = (Xn 1 , · · · , Xnd) ∈ Rd^ is a random vector and Xi p −→ X, i.e., for any  > 0 ,

lim n→∞ P(‖Xn − X‖ ≥ ) = 0

where ‖Xn − X‖ denotes the Euclidean distance between Xn and X, then g(Xn)

p −→ g(X) for a continuous function g.

This justifies X 2 n

p −→ μ^2. Now, we have

lim n→∞

n n − 1

n−^1

∑^ n

i=

X i^2 − X 2 n

= E X^2 − (E X)^2 = σ^2

convergence in probability.

Exercise: Complete the proof to show that S n^2 and Sn are consistent estimators of σ^2 and σ.

Another commonly-used quantity to evaluate the quality of estimator is MSE (mean- squared-error).

Definition 2.2.3 (MSE: mean-squared-error). The mean squared error is defined as

MSE(̂ θn) = E(θ̂n − θ)^2

where the expectation is taken w.r.t. the joint distribution of (X 1 , · · · , Xn).

Recall that the pdf/pmf for (X 1 , · · · , Xn) is

fX 1 ,··· ,Xn (x 1 , · · · , xn) =

∏^ n

i=

fXi (xi)

and the population parameter is associated to the actual distribution fX (x).

Note that by Chebyshev’s inequality, convergence in MSE implies convergence in proba- bility:

P(|̂θn − θ| ≥ ) ≤

E( θ̂n − θ)^2 ^2

The MSE is closely related to bias and variance of θ̂n. In fact, we have the following famous bias-variance decomposition

MSE( θ̂n) = bias( θ̂n)^2 + Var(̂θn).

Proof: The proof is quite straightforward:

MSE(̂θn) = E( θ̂n − θ)^2 = E( θ̂n − μ + μ − θ)^2 = E( θ̂n − μ)^2 + 2 E(̂θn − μ)(μ − θ) + (μ − θ)^2 = E( θ̂n − μ)^2 ︸ ︷︷ ︸ Var(̂ θn)

  • (μ − θ)^2 ︸ ︷︷ ︸ bias(̂ θn)^2

where μ = E θ̂n and the second term equals 0.

Lemma 2.2.2. Convergence in MSE implies convergence in probability.

The proof of this lemma directly follows from Chebyshev’s inequality.

2.3 Confidence interval

All the aforementioned measures of estimators such as sample mean, variance, etc, are called point estimators. Can we provide an interval estimator for an unknown parameter? In other words, we are interested in finding a range of plausible values which contain an unknown parameter with reasonably large probability. This leads to the construction of confidence interval.

What is a confidence interval of θ?

Definition 2.3.1. A 1 −α confidence interval for a parameter θ is an interval Cn = (a, b) where a = a(X 1 , · · · , Xn) and b = b(X 1 , · · · , Xn) are two statistics of the data such that

P(θ ∈ Cn) ≥ 1 − α,

i.e., the interval (a, b) contains θ with probability 1 − α.

Note that we cannot say the probability of θ falling inside (a, b) is 1 − α since Cn is random while θ is a fixed value.

Question: How to construct a confidence interval?

Let’s take a look at a simple yet important example. In fact, CLT is very useful in constructing a confidence interval for the mean.

We have shown that sample mean Xn is a consistent estimator of the population mean μ. By CLT, we have (^) √ n(Xn − μ) σ

→ N (0, 1)

where σ is the standard deviation.

For a sufficiently large n, CLT implies that

P

|Xn − μ| ≤

z 1 −α/ 2 σ √ n

≈ 1 − α

where zα is the α-quantile of standard normal distribution.

Note that

|Xn − μ| ≤

z 1 −α/ 2 σ √ n

⇐⇒ Xn −

z 1 −α/ 2 σ √ n

≤ μ ≤ Xn +

z 1 −α/ 2 σ √ n

In other words, if σ is known, the random interval ( Xn −

z 1 −α/ 2 σ √ n

, Xn +

z 1 −α/ 2 σ √ n

covers μ with probability approximately 1 − α.

A few remarks:

  • Suppose σ is known, then the confidence interval (CI) becomes small as the sample size n increase! Smaller interval is preferred since it means less uncertainty.