Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

introduction in mathematical statistics, Lecture notes of Mathematical Statistics

University of Swaziland Mathematical Statistics

mainly probability and stats mathematical formulas , statistical data chart etc

Typology: Lecture notes

2022/2023

Uploaded on 12/05/2022

nosmilo-mavimbela 🇸🇿

1 document

1 / 122

This page cannot be seen from the preview

Don't miss anything!

Lecture Notes on Mathematical Statistics

Shuyang Ling

December 23, 2020

Discover Lecture notes of Mathematical Statistics University of Swaziland

Partial preview of the text

Download introduction in mathematical statistics and more Lecture notes Mathematical Statistics in PDF only on Docsity!

Lecture Notes on Mathematical Statistics

Shuyang Ling

December 23, 2020

6.6.2 Inference in logistic regression.................... 114 6.6.3 Hypothesis testing.......................... 116 6.6.4 Repeated observations - Binomial outcomes............ 116 6.6.5 General logistic regression...................... 118

This lecture note draft is prepared for MATH-SHU 234 Mathematical Statistics I am teaching at NYU Shanghai. It covers the basics of mathematical statistics at undergrad- uate level.

Chapter 1 Probability

1.1 Probability

Probability theory is the mathematical foundation of statistics. We will review the basics of concepts in probability before we proceed to discuss mathematical statistics.

The core idea of probability theory is studying the randomness. The randomness is described by random variable X, a function from sample space to a number. Each random variable X is associated with a distribution function.

We define the cumulative distribution function (cdf) of X as:

FX (x) = P(X ≤ x). (1.1.1)

The cdf satisfies three properties:

FX (x) is non-decreasing
FX (x) is right-continuous
Limits at the infinity:

lim x→−∞

FX (x) = 0, lim x→∞

FX (x) = 1.

A cdf uniquely determines a random variable; it can be used to compute the probability of X belonging to a certain range

P(a < X ≤ b) = FX (b) − FX (a).

In many applications, we often encounter two important classes of random variables, discrete and continuous random variables.

We say X is a discrete random variable if X takes value from a countable set of num- bers X = {a 1 , a 2 , · · · , an, · · · }.

The probability of X taking value ai is given by

fX (i) = pi = P(X = ai)

For a function ϕ(x) : X → R, we have

E ϕ(X) =

ϕ(x)fX (x) dx.

The variance, as a measure of uncertainty, is

Var(X) = E(X − E X)^2 = E ϕ(X), ϕ(x) = (x − E X)^2.

We sometimes use another form

Var(X) = E X^2 − (E X)^2.

Here E X^2 is referred as the second moment. The p-th moment is defined as

E Xp^ =

xp^ dFX =

{∑n i=1 pia

p ´ i^ ,^ discrete R x

pfX (x) dx, continuous

Independence: Independence is an important concept in probability. Two random variables X and Y are independent if

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B), ∀A, B.

This is equivalent to

P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y) = FX (x)FY (y), ∀x, y,

i.e., the joint cdf of (X, Y ) equals the product of its marginal distributions.

Suppose X and Y are independent, then f (X) and g(Y ) are also independent for two functions f and g. As a result, we have

E f (X)g(Y ) = E f (X) E g(Y ).

Given a sequence of n random variables {Xi}ni=1, they are independent if

P(Xi ≤ xi, 1 ≤ i ≤ n) =

∏^ n

P(Xi ≤ xi).

If Xi is discrete or continuous, then the independence can be characterized by using pmf and pdf:

fX 1 ,··· ,Xn (x 1 , · · · , xn) =

∏^ n

fXi (xi).

The joint pdf/pmf is the product of individual pdf/pmf’s (marginal distribution).

In probability and statistics, we often study the sum of i.i.d. (independent identically distributed) random variables

∑n i=1 Xi.

Exercise: Denote Zn =

∑n i=1 Xi^ as the sum of^ n^ i.i.d. random variables. Then^ E^ Zn^ =^ nμ and Var(Zn) = nσ^2.

We will see more in the next few sections.

1.2 Important distributions

1.2.1 Uniform distribution

If the pdf of a random variable X satisfies

fX (x) =

b − a

, a ≤ x ≤ b.

This is called the uniform distribution, denoted by Unif[a, b]. Its cdf is

FX (x) =

0 , x ≤ a x−a b−a ,^ a < x < b 1 , x ≥ b.

Exercise: Show that E X = (a + b)/2 and Var(X) = (b − a)^2 / 12.

1.2.2 Normal distribution/Gaussian distribution

Normal distribution is the most important distribution in probability and statistics. It has extremely rich structures and connections with other distributions. A random variable X is Gaussian with mean μ and variable σ^2 , denoted by N (μ, σ^2 ), if its pdf is

fX (x) =

2 πσ

e−(x−μ)

(^2) / 2 σ 2 , x ∈ R.

In particular, if μ = 0 and σ = 1, we say X is standard Gaussian. One can verify

1 √ 2 π

e−x

(^2) / 2 dx = 1

by using the trick from multivariate calculus. Let’s verify E X = 0 and Var(X) = 1.

E X =

2 π

xe−x

(^2) / 2 dx = 0

since xe−x (^2) / 2 is an odd function. How about E X^2?

E X^2 =

2 π

x^2 e−x

(^2) / 2 dx

2 π

x de−x

(^2) / 2

2 π

xe−x

∞ −∞

2 π

e−x

(^2) / 2 dx = 1.

Gaussian random variable is linearly invariant: suppose X ∼ N (μ, σ^2 ), then aX + b is still Gaussian with mean aμ + b and variance a^2 σ^2 , i.e., N (aμ + b, a^2 σ^2 )

E(aX + b) = aμ + b, Var(aX + b) = Var(aX) = a^2 Var(X) = a^2 σ^2.

Moreover, suppose X ∼ N (μX , σ X^2 ) and Y ∼ N (μY , σ^2 Y ) are two independent random variables, then X + Y ∼ N (μX + μY , σ^2 X + σ^2 Y ).

This can be extended to the sum of n independent Gaussian random variables. For example, ∑n

Xi ∼ N (0, n)

if Xi ∼ N (0, 1) are i.i.d. random variables.

In particular, if n is a positive integer, Γ(n) = (n − 1)! and Γ(1/2) =

π.

Chi-squared distribution is closely connected to normal distribution. Suppose Z ∼ N (0, 1). Now we take a look at X = Z^2 :

P(X ≤ x) = P(Z^2 ≤ x) = P(−

x ≤ Z ≤

x) = 2P(0 ≤ Z ≤

ˆ √x

e−z

(^2) / 2 dz.

The pdf of X is obtained by differentiating the cdf,

fX (x) =

· e−x/^2 =

2 π

x−^1 /^2 e−x/^2 , x > 0.

Now if {Zi}ni=1 is a sequence of n independent standard normal random variables, then

X =

∑^ n

Z^2 i ∼ χ^2 n.

Chi-squared distribution is a special family of Gamma distribution Γ(α, β).

fX (x; α, β) =

Γ(α)βα^

xα−^1 e−x/β^ , x > 0.

If β = 2 and α = n/2, then Γ(n/ 2 , 2) = χ^2 n.

Exercise: Show that E etX^ = (1 − βt)−α, for t < 1 /β.

Exercise: Show that

∑n i=1 Xi^ ∼^ Γ(

∑n i=1 αi, β) if^ Xi^ ∼^ Γ(αi, β) are independent.

1.2.5 Exponential distribution

Exponential distribution: X has an exponential distribution with parameter β, i.e., E(β) if f (x) = β−^1 e−x/β^ , x ≥ 0

where β > 0.

The exponential distribution is used to model the waiting time of a certain event (lifetimes of electronic components).
The waiting time of a bus arriving at the station.

It is also a special case of Gamma distribution Γ(1, β). Exponential distribution satisfies the so-called memoryless property:

P(X ≥ t + s|X ≥ t) = P(X ≥ s), ∀s ≥ 0.

Recall that the left side involves conditional probability. For two events A and B, the conditional probability of A given B is

P(A|B) =

P(A ∩ B)

P(B)

Here

P(X ≥ t + s|X ≥ t) =

P(X ≥ t + s, X ≥ t) P(X ≥ t)

P(X ≥ t + s) P(X ≥ t)

since {X ≥ t + s} is contained in {X ≥ t}

Exercise: Verify the memoryless properties and think about what does it mean?

Exercise: What is the distribution of

∑n i=1 Xi^ if^ Xi^ ∼ E(β)?

Exercise: Verify E X = β and Var(X) = β^2 for X ∼ E(β).

1.2.6 Bernoulli distributions

Let X represent the outcome of a binary coin flip. Then its pmf is

P(X = 1) = p, P(X = 0) = 1 − p.

Sometimes, we also write the pmf in this way:

fX (x) = px(1 − p)^1 −x, x ∈ { 0 , 1 }.

The coin is fair if p = 1/2. The cdf is

FX (x) =

0 , x < 0 , 1 − p, 0 ≤ x < 1 , 1 , x ≥ 1.

In this case, we denote X ∼Bernoulli(p). The mean and variance of X is simple to obtain: E(X) = 1 · P(X = 1) + 0 · P(X = 0) = p

and Var(X) = E X^2 − (E X)^2 = E X − p^2 = p(1 − p).

1.2.7 Binomial distribution

Suppose we have a coin which falls heads up with probability p. Flip the coin n times and X is the number of heads. Each outcome is supposed to be independent.

If X = k, then there must be k heads and n − k tails:

P(X = k) =

n k

pk(1 − p)n−k

where (^) ( n k

n! k!(n − k)!

Then its pmf is

fX (k) =

n k

pk(1 − p)n−k, k ∈ { 0 , 1 , · · · , n}.

In this case, we denote X ∼Binomial(n, p).

Exercise: Show

∑n k=

(n k

pk(1 − p)n−k^ = 1.

1.3 Limiting theorem

1.3.1 Law of large number

Law of large numbers, along with central limit theorem (CLT), plays fundamental roles in statistical inference and hypothesis testing.

Theorem 1.3.1 (Weak law of large number). Let Xi be a sequence of i.i.d. random variables,

lim n→∞

∑^ n

Xi p −→ μ,

i.e., convergence in probability, where μ = E Xi.

We say a sequence of random variables Xn converge to X in probability if for any given > 0 , lim n→∞ P(|Xn − X| ≥ ) = 0.

Law of large number basically says that the sample average converges to the expected value as the sample size grows to infinity.

We can prove the law of large number easily if assuming Xi has a finite second moment. The proof relies on Chebyshev’s inequality.

Theorem 1.3.2 (Chebyshev’s inequality). For a random variable X with finite second moment, then

P(|X − μ| ≥ ) ≤

E |X − μ|^2 ^2

σ^2 ^2

Proof of WLLN. Consider Xn = n−^1

∑n i=1 Xi. We aim to prove^ Xn^ converges to^ μ^ in probability if Var(Xi) < ∞. For > 0, it holds that

P(|Xn − μ| ≥ ) ≤

^2

E |Xn − μ|^2.

It suffices to compute the variance of Xn:

E |Xn − μ|^2 = E

n−^1

∑^ n

Xi − μ

= n−^2 E

[ (^) n ∑

(Xi − μ)

] 2

= n−^2 ·

∑^ n

E(Xi − μ)^2

= n−^2 · nσ^2 =

σ^2 n

As a result,

P(|Xn − μ| ≥ ) ≤

σ^2 n^2

→ 0 , n → ∞.

1.3.2 Central limit theorem

Theorem 1.3.3 (Central limit theorem). Let Xi, 1 ≤ i ≤ n be a sequence of i.i.d. random variables with mean μ and finite variance σ^2 , then

Zn :=

∑n i=1 √ Xi^ −^ nμ nσ^2

−→ N^ d (0, 1),

i.e., convergence in distribution.

Sometimes, we also use

Zn =

n(Xn − μ) σ

We say a sequence of random variable Zn converges to Z in distribution if

lim n→∞ P(Zn ≤ z) = P(Z ≤ z)

for any z ∈ R. In other words, the cdf of Zn converges to that of Z pointwisely,

lim n→∞ FXn (x) = FX (x).

What does it mean?

lim n→∞

P(a ≤ Zn ≤ b) = P(a ≤ Z ≤ b) =

2 π

ˆ (^) b

e−t (^2) / 2 dt.

In statistics, one useful choice of a and b are

a = zα/ 2 , b = z 1 −α/ 2.

Here zα is defined as the α-quantile of normal random variable, i.e.,

P(Z ≤ zα) =

2 π

ˆ (^) zα

−∞

e−t

(^2) / 2 dt = α, 0 ≤ α ≤ 1

and by symmetry, we have zα = −z 1 −α.

In particular, z 0. 975 ≈ 1. 96.

In other words, as n is sufficiently large, with probability approximately 1 − α, it holds that

zα/ 2 ≤

n(Xn − μ) σ

≤ z 1 −α/ 2 ⇐⇒ |Xn − μ| ≤

σz 1 −α/ 2 √ n

, zα/ 2 = −z 1 −α/ 2

which implies that the “error” decays at the rate of 1/

Theorem 1.3.4. Convergence in probability implies convergence in distribution.

Exercise: Show that E Zn = 0 and Var(Zn) = 1.

The proof of CLT relies on the moment generating function. We can show that the MGF of Zn converges to that of a standard normal. Here we provide a sketch of the proof.

Chapter 2 Introduction to statistics

2.1 Population

One core task of statistics is making inferences about an unknown parameter θ associated to a population. What is a population? In statistics, a population is a set consisting of the entire similar items we are interested in. For example, a population may refer to all the college students in Shanghai or all the residents in Shanghai. The choice of the population depends on the actual scientific problem.

Suppose we want to know the average height of all the college students in Shanghai or want to know the age distribution of residents in Shanghai. What should we do? Usually, a population is too large to deal with directly. Instead, we often draw samples from the population and then use the samples to estimate a population parameter θ such as mean, variance, median, or even the actual distribution.

This leads to several important questions in statistics:

How to design a proper sampling procedure to collect data? Statistical/Experimental design.
How to use the data to estimate a particular population parameter?
How to evaluate the quality of an estimator?

2.1.1 Important statistics

If we get a dataset, we usually compute the basic statistics to roughly describe the dataset.

Sample mean/average:

xn =

∑^ n

Variance: S n^2 =

n − 1

∑^ n

(xi − xn)^2

Standard deviation:

Sn =

n − 1

∑^ n

(xi − xn)^2

Median: median(xi) = x[(n+1)/2] where [(n + 1)/2] means the closest integer to (n + 1)/2. More generally, the α quantile is x[α(n+1)].
Range, max/min: Range = max 1 ≤i≤n xi − min 1 ≤i≤n xi.
Empirical cdf:

Fn(x) =

∑^ n

1 {xi ≤ x}.

Exercise: Show that xn minimizes

f (z) =

∑^ n

(xi − z)^2

Exercise: Show that median(xi) minimizes

f (z) =

∑^ n

|xi − z|

Is the global minimizer unique?

Notice that all the quantities above are based on samples {x 1 , · · · , xn}. These quantities are called statistics.

Definition 2.1.1. A statistic is a deterministic function of samples,

y = T (x 1 , · · · , xn),

which is used to estimate the value of a population parameter θ.

Question: How to evaluate the quality of these estimators?

2.1.2 Probabilistic assumption

We assume the population has a probability distribution FX and each observed sample xi is a realization of a random variable Xi obeying the population distribution FX. A set of samples {x 1 , · · · , xn} are treated as one realization of a random sequence {X 1 , · · · , Xn}. From now on, we assume that all the random variables Xi are i.i.d., independent identi- cally distributed.

In other words, T (x 1 ,... , xn) is one copy of a random variable

̂ θn = T (X 1 , · · · , Xn),

which is a point estimator of θ. We ask several questions:

Does θ̂n well approximate the population parameter θ?
How to evaluate the quality of the estimators?

Theorem 2.2.1 (Continuous mapping theorem). Suppose g is a continuous function and

Xn p −→ X, then g(Xn) p −→ g(X). This also applies to convergence in distribution.

Remark: this is also true for random vectors. Suppose Xn = (Xn 1 , · · · , Xnd) ∈ Rd^ is a random vector and Xi p −→ X, i.e., for any > 0 ,

lim n→∞ P(‖Xn − X‖ ≥ ) = 0

where ‖Xn − X‖ denotes the Euclidean distance between Xn and X, then g(Xn)

p −→ g(X) for a continuous function g.

This justifies X 2 n

p −→ μ^2. Now, we have

lim n→∞

n n − 1

n−^1

∑^ n

X i^2 − X 2 n

= E X^2 − (E X)^2 = σ^2

convergence in probability.

Exercise: Complete the proof to show that S n^2 and Sn are consistent estimators of σ^2 and σ.

Another commonly-used quantity to evaluate the quality of estimator is MSE (mean- squared-error).

Definition 2.2.3 (MSE: mean-squared-error). The mean squared error is defined as

MSE(̂ θn) = E(θ̂n − θ)^2

where the expectation is taken w.r.t. the joint distribution of (X 1 , · · · , Xn).

Recall that the pdf/pmf for (X 1 , · · · , Xn) is

fX 1 ,··· ,Xn (x 1 , · · · , xn) =

∏^ n

fXi (xi)

and the population parameter is associated to the actual distribution fX (x).

Note that by Chebyshev’s inequality, convergence in MSE implies convergence in proba- bility:

P(|̂θn − θ| ≥ ) ≤

E( θ̂n − θ)^2 ^2

The MSE is closely related to bias and variance of θ̂n. In fact, we have the following famous bias-variance decomposition

MSE( θ̂n) = bias( θ̂n)^2 + Var(̂θn).

Proof: The proof is quite straightforward:

MSE(̂θn) = E( θ̂n − θ)^2 = E( θ̂n − μ + μ − θ)^2 = E( θ̂n − μ)^2 + 2 E(̂θn − μ)(μ − θ) + (μ − θ)^2 = E( θ̂n − μ)^2 ︸︷︷︸ Var(̂ θn)

(μ − θ)^2 ︸︷︷︸ bias(̂ θn)^2

where μ = E θ̂n and the second term equals 0.

Lemma 2.2.2. Convergence in MSE implies convergence in probability.

The proof of this lemma directly follows from Chebyshev’s inequality.

2.3 Confidence interval

All the aforementioned measures of estimators such as sample mean, variance, etc, are called point estimators. Can we provide an interval estimator for an unknown parameter? In other words, we are interested in finding a range of plausible values which contain an unknown parameter with reasonably large probability. This leads to the construction of confidence interval.

What is a confidence interval of θ?

Definition 2.3.1. A 1 −α confidence interval for a parameter θ is an interval Cn = (a, b) where a = a(X 1 , · · · , Xn) and b = b(X 1 , · · · , Xn) are two statistics of the data such that

P(θ ∈ Cn) ≥ 1 − α,

i.e., the interval (a, b) contains θ with probability 1 − α.

Note that we cannot say the probability of θ falling inside (a, b) is 1 − α since Cn is random while θ is a fixed value.

Question: How to construct a confidence interval?

Let’s take a look at a simple yet important example. In fact, CLT is very useful in constructing a confidence interval for the mean.

We have shown that sample mean Xn is a consistent estimator of the population mean μ. By CLT, we have (^) √ n(Xn − μ) σ

→ N (0, 1)

where σ is the standard deviation.

For a sufficiently large n, CLT implies that

|Xn − μ| ≤

z 1 −α/ 2 σ √ n

≈ 1 − α

where zα is the α-quantile of standard normal distribution.

Note that

|Xn − μ| ≤

z 1 −α/ 2 σ √ n

⇐⇒ Xn −

z 1 −α/ 2 σ √ n

≤ μ ≤ Xn +

z 1 −α/ 2 σ √ n

In other words, if σ is known, the random interval ( Xn −

z 1 −α/ 2 σ √ n

, Xn +

z 1 −α/ 2 σ √ n

covers μ with probability approximately 1 − α.

A few remarks:

Suppose σ is known, then the confidence interval (CI) becomes small as the sample size n increase! Smaller interval is preferred since it means less uncertainty.

introduction in mathematical statistics, Lecture notes of Mathematical Statistics

Related documents

Partial preview of the text

Download introduction in mathematical statistics and more Lecture notes Mathematical Statistics in PDF only on Docsity!

Lecture Notes on Mathematical Statistics

Shuyang Ling

December 23, 2020

Contents

Chapter 1

Probability

1.1 Probability

1.2.1 Uniform distribution

1.2.2 Normal distribution/Gaussian distribution

X =

1.2.5 Exponential distribution

P(A|B) =

P(A ∩ B)

P(B)

1.2.6 Bernoulli distributions

1.2.7 Binomial distribution

1.3.1 Law of large number

^2

] 2

1.3.2 Central limit theorem

Chapter 2

Introduction to statistics

2.1 Population

2.1.1 Important statistics

2.1.2 Probabilistic assumption

→ N (0, 1)

introduction in mathematical statistics, Lecture notes of Mathematical Statistics

Related documents

Partial preview of the text

Download introduction in mathematical statistics and more Lecture notes Mathematical Statistics in PDF only on Docsity!

Lecture Notes on Mathematical Statistics

Shuyang Ling

December 23, 2020

Contents

Chapter 1

Probability

1.1 Probability

1.2.1 Uniform distribution

1.2.2 Normal distribution/Gaussian distribution

X =

1.2.5 Exponential distribution

P(A|B) =

P(A ∩ B)

P(B)

1.2.6 Bernoulli distributions

1.2.7 Binomial distribution

1.3.1 Law of large number

^2

] 2

1.3.2 Central limit theorem

Chapter 2

Introduction to statistics

2.1 Population

2.1.1 Important statistics

2.1.2 Probabilistic assumption

→ N (0, 1)

^2