




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
mainly probability and stats mathematical formulas , statistical data chart etc
Typology: Lecture notes
1 / 122
This page cannot be seen from the preview
Don't miss anything!





























































































6.6.2 Inference in logistic regression.................... 114 6.6.3 Hypothesis testing.......................... 116 6.6.4 Repeated observations - Binomial outcomes............ 116 6.6.5 General logistic regression...................... 118
This lecture note draft is prepared for MATH-SHU 234 Mathematical Statistics I am teaching at NYU Shanghai. It covers the basics of mathematical statistics at undergrad- uate level.
Probability theory is the mathematical foundation of statistics. We will review the basics of concepts in probability before we proceed to discuss mathematical statistics.
The core idea of probability theory is studying the randomness. The randomness is described by random variable X, a function from sample space to a number. Each random variable X is associated with a distribution function.
We define the cumulative distribution function (cdf) of X as:
FX (x) = P(X ≤ x). (1.1.1)
The cdf satisfies three properties:
lim x→−∞
FX (x) = 0, lim x→∞
FX (x) = 1.
A cdf uniquely determines a random variable; it can be used to compute the probability of X belonging to a certain range
P(a < X ≤ b) = FX (b) − FX (a).
In many applications, we often encounter two important classes of random variables, discrete and continuous random variables.
We say X is a discrete random variable if X takes value from a countable set of num- bers X = {a 1 , a 2 , · · · , an, · · · }.
The probability of X taking value ai is given by
fX (i) = pi = P(X = ai)
For a function ϕ(x) : X → R, we have
E ϕ(X) =
R
ϕ(x)fX (x) dx.
The variance, as a measure of uncertainty, is
Var(X) = E(X − E X)^2 = E ϕ(X), ϕ(x) = (x − E X)^2.
We sometimes use another form
Var(X) = E X^2 − (E X)^2.
Here E X^2 is referred as the second moment. The p-th moment is defined as
E Xp^ =
R
xp^ dFX =
{∑n i=1 pia
p ´ i^ ,^ discrete R x
pfX (x) dx, continuous
Independence: Independence is an important concept in probability. Two random variables X and Y are independent if
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B), ∀A, B.
This is equivalent to
P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y) = FX (x)FY (y), ∀x, y,
i.e., the joint cdf of (X, Y ) equals the product of its marginal distributions.
Suppose X and Y are independent, then f (X) and g(Y ) are also independent for two functions f and g. As a result, we have
E f (X)g(Y ) = E f (X) E g(Y ).
Given a sequence of n random variables {Xi}ni=1, they are independent if
P(Xi ≤ xi, 1 ≤ i ≤ n) =
∏^ n
i=
P(Xi ≤ xi).
If Xi is discrete or continuous, then the independence can be characterized by using pmf and pdf:
fX 1 ,··· ,Xn (x 1 , · · · , xn) =
∏^ n
i=
fXi (xi).
The joint pdf/pmf is the product of individual pdf/pmf’s (marginal distribution).
In probability and statistics, we often study the sum of i.i.d. (independent identically distributed) random variables
∑n i=1 Xi.
Exercise: Denote Zn =
∑n i=1 Xi^ as the sum of^ n^ i.i.d. random variables. Then^ E^ Zn^ =^ nμ and Var(Zn) = nσ^2.
We will see more in the next few sections.
1.2 Important distributions
If the pdf of a random variable X satisfies
fX (x) =
b − a
, a ≤ x ≤ b.
This is called the uniform distribution, denoted by Unif[a, b]. Its cdf is
FX (x) =
0 , x ≤ a x−a b−a ,^ a < x < b 1 , x ≥ b.
Exercise: Show that E X = (a + b)/2 and Var(X) = (b − a)^2 / 12.
Normal distribution is the most important distribution in probability and statistics. It has extremely rich structures and connections with other distributions. A random variable X is Gaussian with mean μ and variable σ^2 , denoted by N (μ, σ^2 ), if its pdf is
fX (x) =
2 πσ
e−(x−μ)
(^2) / 2 σ 2 , x ∈ R.
In particular, if μ = 0 and σ = 1, we say X is standard Gaussian. One can verify
1 √ 2 π
R
e−x
(^2) / 2 dx = 1
by using the trick from multivariate calculus. Let’s verify E X = 0 and Var(X) = 1.
E X =
2 π
R
xe−x
(^2) / 2 dx = 0
since xe−x (^2) / 2 is an odd function. How about E X^2?
E X^2 =
2 π
R
x^2 e−x
(^2) / 2 dx
2 π
R
x de−x
(^2) / 2
2 π
xe−x
∞ −∞
2 π
R
e−x
(^2) / 2 dx = 1.
Gaussian random variable is linearly invariant: suppose X ∼ N (μ, σ^2 ), then aX + b is still Gaussian with mean aμ + b and variance a^2 σ^2 , i.e., N (aμ + b, a^2 σ^2 )
E(aX + b) = aμ + b, Var(aX + b) = Var(aX) = a^2 Var(X) = a^2 σ^2.
Moreover, suppose X ∼ N (μX , σ X^2 ) and Y ∼ N (μY , σ^2 Y ) are two independent random variables, then X + Y ∼ N (μX + μY , σ^2 X + σ^2 Y ).
This can be extended to the sum of n independent Gaussian random variables. For example, ∑n
i=
Xi ∼ N (0, n)
if Xi ∼ N (0, 1) are i.i.d. random variables.
In particular, if n is a positive integer, Γ(n) = (n − 1)! and Γ(1/2) =
π.
Chi-squared distribution is closely connected to normal distribution. Suppose Z ∼ N (0, 1). Now we take a look at X = Z^2 :
P(X ≤ x) = P(Z^2 ≤ x) = P(−
x ≤ Z ≤
x) = 2P(0 ≤ Z ≤
x)
=
π
ˆ √x
0
e−z
(^2) / 2 dz.
The pdf of X is obtained by differentiating the cdf,
fX (x) =
π
x
· e−x/^2 =
2 π
x−^1 /^2 e−x/^2 , x > 0.
Now if {Zi}ni=1 is a sequence of n independent standard normal random variables, then
∑^ n
i=
Z^2 i ∼ χ^2 n.
Chi-squared distribution is a special family of Gamma distribution Γ(α, β).
fX (x; α, β) =
Γ(α)βα^
xα−^1 e−x/β^ , x > 0.
If β = 2 and α = n/2, then Γ(n/ 2 , 2) = χ^2 n.
Exercise: Show that E etX^ = (1 − βt)−α, for t < 1 /β.
Exercise: Show that
∑n i=1 Xi^ ∼^ Γ(
∑n i=1 αi, β) if^ Xi^ ∼^ Γ(αi, β) are independent.
Exponential distribution: X has an exponential distribution with parameter β, i.e., E(β) if f (x) = β−^1 e−x/β^ , x ≥ 0
where β > 0.
It is also a special case of Gamma distribution Γ(1, β). Exponential distribution satisfies the so-called memoryless property:
P(X ≥ t + s|X ≥ t) = P(X ≥ s), ∀s ≥ 0.
Recall that the left side involves conditional probability. For two events A and B, the conditional probability of A given B is
Here
P(X ≥ t + s|X ≥ t) =
P(X ≥ t + s, X ≥ t) P(X ≥ t)
P(X ≥ t + s) P(X ≥ t)
since {X ≥ t + s} is contained in {X ≥ t}
Exercise: Verify the memoryless properties and think about what does it mean?
Exercise: What is the distribution of
∑n i=1 Xi^ if^ Xi^ ∼ E(β)?
Exercise: Verify E X = β and Var(X) = β^2 for X ∼ E(β).
Let X represent the outcome of a binary coin flip. Then its pmf is
P(X = 1) = p, P(X = 0) = 1 − p.
Sometimes, we also write the pmf in this way:
fX (x) = px(1 − p)^1 −x, x ∈ { 0 , 1 }.
The coin is fair if p = 1/2. The cdf is
FX (x) =
0 , x < 0 , 1 − p, 0 ≤ x < 1 , 1 , x ≥ 1.
In this case, we denote X ∼Bernoulli(p). The mean and variance of X is simple to obtain: E(X) = 1 · P(X = 1) + 0 · P(X = 0) = p
and Var(X) = E X^2 − (E X)^2 = E X − p^2 = p(1 − p).
Suppose we have a coin which falls heads up with probability p. Flip the coin n times and X is the number of heads. Each outcome is supposed to be independent.
If X = k, then there must be k heads and n − k tails:
P(X = k) =
n k
pk(1 − p)n−k
where (^) ( n k
n! k!(n − k)!
Then its pmf is
fX (k) =
n k
pk(1 − p)n−k, k ∈ { 0 , 1 , · · · , n}.
In this case, we denote X ∼Binomial(n, p).
Exercise: Show
∑n k=
(n k
pk(1 − p)n−k^ = 1.
1.3 Limiting theorem
Law of large numbers, along with central limit theorem (CLT), plays fundamental roles in statistical inference and hypothesis testing.
Theorem 1.3.1 (Weak law of large number). Let Xi be a sequence of i.i.d. random variables,
lim n→∞
n
∑^ n
i=
Xi p −→ μ,
i.e., convergence in probability, where μ = E Xi.
We say a sequence of random variables Xn converge to X in probability if for any given > 0 , lim n→∞ P(|Xn − X| ≥ ) = 0.
Law of large number basically says that the sample average converges to the expected value as the sample size grows to infinity.
We can prove the law of large number easily if assuming Xi has a finite second moment. The proof relies on Chebyshev’s inequality.
Theorem 1.3.2 (Chebyshev’s inequality). For a random variable X with finite second moment, then
P(|X − μ| ≥ ) ≤
E |X − μ|^2 ^2
σ^2 ^2
Proof of WLLN. Consider Xn = n−^1
∑n i=1 Xi. We aim to prove^ Xn^ converges to^ μ^ in probability if Var(Xi) < ∞. For > 0, it holds that
P(|Xn − μ| ≥ ) ≤
E |Xn − μ|^2.
It suffices to compute the variance of Xn:
E |Xn − μ|^2 = E
n−^1
∑^ n
i=
Xi − μ
2
= n−^2 E
[ (^) n ∑
i=
(Xi − μ)
= n−^2 ·
∑^ n
i=
E(Xi − μ)^2
= n−^2 · nσ^2 =
σ^2 n
As a result,
P(|Xn − μ| ≥ ) ≤
σ^2 n^2
→ 0 , n → ∞.
Theorem 1.3.3 (Central limit theorem). Let Xi, 1 ≤ i ≤ n be a sequence of i.i.d. random variables with mean μ and finite variance σ^2 , then
Zn :=
∑n i=1 √ Xi^ −^ nμ nσ^2
−→ N^ d (0, 1),
i.e., convergence in distribution.
Sometimes, we also use
Zn =
n(Xn − μ) σ
We say a sequence of random variable Zn converges to Z in distribution if
lim n→∞ P(Zn ≤ z) = P(Z ≤ z)
for any z ∈ R. In other words, the cdf of Zn converges to that of Z pointwisely,
lim n→∞ FXn (x) = FX (x).
What does it mean?
lim n→∞
P(a ≤ Zn ≤ b) = P(a ≤ Z ≤ b) =
2 π
ˆ (^) b
a
e−t (^2) / 2 dt.
In statistics, one useful choice of a and b are
a = zα/ 2 , b = z 1 −α/ 2.
Here zα is defined as the α-quantile of normal random variable, i.e.,
P(Z ≤ zα) =
2 π
ˆ (^) zα
−∞
e−t
(^2) / 2 dt = α, 0 ≤ α ≤ 1
and by symmetry, we have zα = −z 1 −α.
In particular, z 0. 975 ≈ 1. 96.
In other words, as n is sufficiently large, with probability approximately 1 − α, it holds that
zα/ 2 ≤
n(Xn − μ) σ
≤ z 1 −α/ 2 ⇐⇒ |Xn − μ| ≤
σz 1 −α/ 2 √ n
, zα/ 2 = −z 1 −α/ 2
which implies that the “error” decays at the rate of 1/
n.
Theorem 1.3.4. Convergence in probability implies convergence in distribution.
Exercise: Show that E Zn = 0 and Var(Zn) = 1.
The proof of CLT relies on the moment generating function. We can show that the MGF of Zn converges to that of a standard normal. Here we provide a sketch of the proof.
One core task of statistics is making inferences about an unknown parameter θ associated to a population. What is a population? In statistics, a population is a set consisting of the entire similar items we are interested in. For example, a population may refer to all the college students in Shanghai or all the residents in Shanghai. The choice of the population depends on the actual scientific problem.
Suppose we want to know the average height of all the college students in Shanghai or want to know the age distribution of residents in Shanghai. What should we do? Usually, a population is too large to deal with directly. Instead, we often draw samples from the population and then use the samples to estimate a population parameter θ such as mean, variance, median, or even the actual distribution.
This leads to several important questions in statistics:
If we get a dataset, we usually compute the basic statistics to roughly describe the dataset.
xn =
n
∑^ n
i=
xi
n − 1
∑^ n
i=
(xi − xn)^2
Sn =
n − 1
∑^ n
i=
(xi − xn)^2
Fn(x) =
n
∑^ n
i=
1 {xi ≤ x}.
Exercise: Show that xn minimizes
f (z) =
∑^ n
i=
(xi − z)^2
Exercise: Show that median(xi) minimizes
f (z) =
∑^ n
i=
|xi − z|
Is the global minimizer unique?
Notice that all the quantities above are based on samples {x 1 , · · · , xn}. These quantities are called statistics.
Definition 2.1.1. A statistic is a deterministic function of samples,
y = T (x 1 , · · · , xn),
which is used to estimate the value of a population parameter θ.
Question: How to evaluate the quality of these estimators?
We assume the population has a probability distribution FX and each observed sample xi is a realization of a random variable Xi obeying the population distribution FX. A set of samples {x 1 , · · · , xn} are treated as one realization of a random sequence {X 1 , · · · , Xn}. From now on, we assume that all the random variables Xi are i.i.d., independent identi- cally distributed.
In other words, T (x 1 ,... , xn) is one copy of a random variable
̂ θn = T (X 1 , · · · , Xn),
which is a point estimator of θ. We ask several questions:
Theorem 2.2.1 (Continuous mapping theorem). Suppose g is a continuous function and
Xn p −→ X, then g(Xn) p −→ g(X). This also applies to convergence in distribution.
Remark: this is also true for random vectors. Suppose Xn = (Xn 1 , · · · , Xnd) ∈ Rd^ is a random vector and Xi p −→ X, i.e., for any > 0 ,
lim n→∞ P(‖Xn − X‖ ≥ ) = 0
where ‖Xn − X‖ denotes the Euclidean distance between Xn and X, then g(Xn)
p −→ g(X) for a continuous function g.
This justifies X 2 n
p −→ μ^2. Now, we have
lim n→∞
n n − 1
n−^1
∑^ n
i=
X i^2 − X 2 n
= E X^2 − (E X)^2 = σ^2
convergence in probability.
Exercise: Complete the proof to show that S n^2 and Sn are consistent estimators of σ^2 and σ.
Another commonly-used quantity to evaluate the quality of estimator is MSE (mean- squared-error).
Definition 2.2.3 (MSE: mean-squared-error). The mean squared error is defined as
MSE(̂ θn) = E(θ̂n − θ)^2
where the expectation is taken w.r.t. the joint distribution of (X 1 , · · · , Xn).
Recall that the pdf/pmf for (X 1 , · · · , Xn) is
fX 1 ,··· ,Xn (x 1 , · · · , xn) =
∏^ n
i=
fXi (xi)
and the population parameter is associated to the actual distribution fX (x).
Note that by Chebyshev’s inequality, convergence in MSE implies convergence in proba- bility:
P(|̂θn − θ| ≥ ) ≤
E( θ̂n − θ)^2 ^2
The MSE is closely related to bias and variance of θ̂n. In fact, we have the following famous bias-variance decomposition
MSE( θ̂n) = bias( θ̂n)^2 + Var(̂θn).
Proof: The proof is quite straightforward:
MSE(̂θn) = E( θ̂n − θ)^2 = E( θ̂n − μ + μ − θ)^2 = E( θ̂n − μ)^2 + 2 E(̂θn − μ)(μ − θ) + (μ − θ)^2 = E( θ̂n − μ)^2 ︸ ︷︷ ︸ Var(̂ θn)
where μ = E θ̂n and the second term equals 0.
Lemma 2.2.2. Convergence in MSE implies convergence in probability.
The proof of this lemma directly follows from Chebyshev’s inequality.
2.3 Confidence interval
All the aforementioned measures of estimators such as sample mean, variance, etc, are called point estimators. Can we provide an interval estimator for an unknown parameter? In other words, we are interested in finding a range of plausible values which contain an unknown parameter with reasonably large probability. This leads to the construction of confidence interval.
What is a confidence interval of θ?
Definition 2.3.1. A 1 −α confidence interval for a parameter θ is an interval Cn = (a, b) where a = a(X 1 , · · · , Xn) and b = b(X 1 , · · · , Xn) are two statistics of the data such that
P(θ ∈ Cn) ≥ 1 − α,
i.e., the interval (a, b) contains θ with probability 1 − α.
Note that we cannot say the probability of θ falling inside (a, b) is 1 − α since Cn is random while θ is a fixed value.
Question: How to construct a confidence interval?
Let’s take a look at a simple yet important example. In fact, CLT is very useful in constructing a confidence interval for the mean.
We have shown that sample mean Xn is a consistent estimator of the population mean μ. By CLT, we have (^) √ n(Xn − μ) σ
where σ is the standard deviation.
For a sufficiently large n, CLT implies that
P
|Xn − μ| ≤
z 1 −α/ 2 σ √ n
≈ 1 − α
where zα is the α-quantile of standard normal distribution.
Note that
|Xn − μ| ≤
z 1 −α/ 2 σ √ n
⇐⇒ Xn −
z 1 −α/ 2 σ √ n
≤ μ ≤ Xn +
z 1 −α/ 2 σ √ n
In other words, if σ is known, the random interval ( Xn −
z 1 −α/ 2 σ √ n
, Xn +
z 1 −α/ 2 σ √ n
covers μ with probability approximately 1 − α.
A few remarks: