The Multivariate Normal Distribution, Lecture notes of Mathematics

Joint Gaussian random variables arise from nonsingular linear transformations on independent normal random variables.

Typology: Lecture notes

2020/2021

Uploaded on 06/11/2021

paulina
paulina 🇺🇸

4.4

(13)

240 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture 21. The Multivariate Normal Distribution
21.1 Definitions and Comments
The joint moment-generating function of X1,... ,X
n[also called the moment-generating
function of the random vector (X1,... ,X
n)] is defined by
M(t1,... ,t
n)=E[exp(t1X1+···+tnXn)].
Just as in the one-dimensional case, the moment-generating function determines the den-
sity uniquely. The random variables X1,... ,X
nare said to have the multivariate normal
distribution or to be jointly Gaussian (we also say that the random vector (X1,... ,X
n)
is Gaussian)if
M(t1,... ,t
n) = exp(t1µ1+···+tnµn) exp
1
2
n
i,j=1
tiaijtj
where the tiand µjare arbitrary real numbers, and the matrix Ais symmetric and
positive definite.
Before we do anything else, let us indicate the notational scheme we will be using.
Vectors will be written with an underbar, and are assumed to be column vectors unless
otherwise specified. If tis a column vector with components t1,... ,t
n, then to save space
we write t=(t1,... ,t
n). The row vector with these components is the transpose of t,
written t. The moment-generating function of jointly Gaussian random variables has the
form
M(t1,... ,t
n) = exp(tµ) exp 1
2tAt.
We can describe Gaussian random vectors much more concretely.
21.2 Theorem
Joint Gaussian random variables arise from nonsingular linear transformations on inde-
pendent normal random variables.
Proof. Let X1,... ,X
nbe independent, with Xinormal (0,λi), and let X=(X1,... ,X
n).
Let Y=BX+µwhere Bis nonsingular. Then Yis Gaussian, as can be seen by computing
the moment-generating function of Y:
MY(t)=E[exp(tY)] = E[exp(tBX)] exp(tµ).
But
E[exp(uX)] =
n
i=1
E[exp(uiXi)] = exp n
i=1
λiu2
i/2= exp 1
2uDu
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download The Multivariate Normal Distribution and more Lecture notes Mathematics in PDF only on Docsity!

Lecture 21. The Multivariate Normal Distribution

21.1 Definitions and Comments

The joint moment-generating function of X 1 ,... , Xn [also called the moment-generating function of the random vector (X 1 ,... , Xn)] is defined by

M (t 1 ,... , tn) = E[exp(t 1 X 1 + · · · + tnXn)].

Just as in the one-dimensional case, the moment-generating function determines the den- sity uniquely. The random variables X 1 ,... , Xn are said to have the multivariate normal distribution or to be jointly Gaussian (we also say that the random vector (X 1 ,... , Xn) is Gaussian) if

M (t 1 ,... , tn) = exp(t 1 μ 1 + · · · + tnμn) exp

∑^ n

i,j=

tiaij tj

where the ti and μj are arbitrary real numbers, and the matrix A is symmetric and positive definite.

Before we do anything else, let us indicate the notational scheme we will be using. Vectors will be written with an underbar, and are assumed to be column vectors unless otherwise specified. If t is a column vector with components t 1 ,... , tn, then to save space we write t = (t 1 ,... , tn)′. The row vector with these components is the transpose of t, written t′. The moment-generating function of jointly Gaussian random variables has the form

M (t 1 ,... , tn) = exp(t′μ) exp

t′At

We can describe Gaussian random vectors much more concretely.

21.2 Theorem

Joint Gaussian random variables arise from nonsingular linear transformations on inde- pendent normal random variables.

Proof. Let X 1 ,... , Xn be independent, with Xi normal (0,λi), and let X = (X 1 ,... , Xn)′. Let Y = BX +μ where B is nonsingular. Then Y is Gaussian, as can be seen by computing the moment-generating function of Y :

MY (t) = E[exp(t′Y )] = E[exp(t′BX)] exp(t′μ).

But

E[exp(u′X)] =

∏^ n

i=

E[exp(uiXi)] = exp

( ∑n

i=

λiu^2 i / 2

= exp

u′Du

where D is a diagonal matrix with λi’s down the main diagonal. Set u = B′t, u′^ = t′B; then

MY (t) = exp(t′μ) exp(

t′BDB′t)

and BDB′^ is symmetric since D is symmetric. Since t′BDB′t = u′Du, which is greater than 0 except when u = 0 (equivalently when t = 0 because B is nonsingular), BDB′^ is positive definite, and consequently Y is Gaussian.

Conversely, suppose that the moment-generating function of Y is exp(t′μ) exp[(1/2)t′At)] where A is symmetric and positive definite. Let L be an orthogonal matrix such that L′AL = D, where D is the diagonal matrix of eigenvalues of A. Set X = L′(Y − μ), so that Y = μ + LX. The moment-generating function of X is

E[exp(t′X)] = exp(−t′L′μ)E[exp(t′L′Y )].

The last term is the moment-generating function of Y with t′^ replaced by t′L′, or equiv- alently, t replaced by Lt. Thus the moment-generating function of X becomes

exp(−t′L′μ) exp(t′L′μ) exp

t′L′ALt

This reduces to

exp

t′Dt

= exp

∑^ n

i=

λit^2 i

Therefore the Xi are independent, with Xi normal (0, λi). ♣

21.3 A Geometric Interpretation

Assume for simplicity that the random variables Xi have zero mean. If E(U ) = E(V ) = 0 then the covariance of U and V is E(U V ), which can be regarded as an inner product. Then Y 1 − μ 1 ,... , Yn − μn span an n-dimensional space, and X 1 ,... , Xn is an orthogonal basis for that space. We will see later in the lecture that orthogonality is equivalent to independence. (Orthogonality means that the Xi are uncorrelated, i.e., E(XiXj ) = 0 for i = j.)

21.4 Theorem

Let Y = μ + LX as in the proof of (21.2), and let A be the symmetric, positive definite matrix appearing in the moment-generating function of the Gaussian random vector Y. Then E(Yi) = μi for all i, and furthermore, A is the covariance matrix of the Yi, in other words, aij = Cov(Yi, Yj ) (and aii = Cov(Yi, Yi) = Var Yi).

It follows that the means of the Yi and their covariance matrix determine the moment- generating function, and therefore the density.

21.7 Theorem

If X 1 ,... , Xn are jointly Gaussian and uncorrelated (Cov(Xi, Xj ) = 0 for all i = j), then the Xi are independent.

Proof. The moment-generating function of X = (X 1 ,... , Xn) is

MX (t) = exp(t′μ) exp

t′Kt

where K is a diagonal matrix with entries σ 12 , σ 22 ,... , σ n^2 down the main diagonal, and 0’s elsewhere. Thus

MX (t) =

∏^ n

i=

exp(tiμi) exp

σ^2 i t^2 i

which is the joint moment-generating function of independent random variables X 1 ,... , Xn, whee Xi is normal (μi, σ i^2 ). ♣

21.8 A Conditional Density

Let X 1 ,... , Xn be jointly Gaussian. We find the conditional density of Xn given X 1 ,... , Xn− 1 :

f (xn|x 1 ,... , xn− 1 ) =

f (x 1 ,... , xn) f (x 1 ,... , xn− 1 )

with

f (x 1 ,... , xn) = (2π)−n/^2 (det K)−^1 /^2 exp

[

∑^ n

i,j=

yiqij yj

]

where Q = K−^1 = [qij ], yi = xi − μi. Also,

f (x 1 ,... , xn− 1 ) =

−∞

f (x 1 ,... , xn− 1 , xn) dxn = B(y 1 ,... , yn− 1 ).

Now

∑^ n

i,j=

yiqij yj =

n∑− 1

i,j=

yiqij yj + yn

n∑− 1

j=

qnj yj + yn

n∑− 1

i=

qinyi + qnny n^2.

Thus the conditional density has the form

A(y 1 ,... , yn− 1 ) B(y 1 ,... , yn− 1 ) exp[−(Cy n^2 + D(y 1 ,... , yn− 1 )yn]

with C = (1/2)qnn, D =

∑n− 1 j=1 qnj^ yj^ =^

∑n− 1 i=1 qinyi^ since^ Q^ =^ K

− (^1) is symmetric. The

conditional density may now be expressed as

A B

exp

( D^2

4 C

exp

[

− C(yn +

D

2 C

)^2

]

We conclude that

given X 1 ,... , Xn− 1 , Xn is normal.

The conditional variance of Xn (the same as the conditional variance of Yn = Xn − μn) is

1 2 C

qnn because

2 σ^2 = C, σ^2 =

2 C

Thus

Var(Xn|X 1 ,... , Xn− 1 ) =

qnn

and the conditional mean of Yn is

D

2 C

qnn

n∑− 1

j=

qnj Yj

so the conditional mean of Xn is

E(Xn|X 1 ,... , Xn− 1 ) = μn −

qnn

n∑− 1

j=

qnj (Xj − μj ).

Recall from Lecture 18 that E(Y |X) is the best estimate of Y based on X, in the sense that the mean square error is minimized. In the joint Gaussian case, the best estimate of Xn based on X 1 ,... , Xn− 1 is linear, and it follows that the best linear estimate is in fact the best overall estimate. This has important practical applications, since linear systems are usually much easier than nonlinear systems to implement and analyze.

Problems

  1. Let K be the covariance matrix of arbitrary random variables X 1 ,... , Xn. Assume that K is nonsingular to avoid degenerate cases. Show that K is symmetric and positive definite. What can you conclude if K is singular?
  2. If X is a Gaussian n-vector and Y = AX with A nonsingular, show that Y is Gaussian.
  3. If X 1 ,... , Xn are jointly Gaussian, show that X 1 ,... , Xm are jointly Gaussian for m ≤ n.
  4. If X 1 ,... , Xn are jointly Gaussian, show that c 1 X 1 + · · · + cnXn is a normal random variable (assuming it is nondegenerate, i.e., not identically constant).

22.2 Example

Let X be the height of the father, Y the height of the son, in a sample of father-son pairs. Assume X and Y bivariate normal, as found by Karl Pearson around 1900. Assume E(X) = 68 (inches), E(Y ) = 69, σX = σY = 2, ρ = .5. (We expect ρ to be positive because on the average, the taller the father, the taller the son.)

Given X = 80 (6 feet 8 inches), Y is normal with mean

μY +

ρσY σX (x − μX ) = 69 + .5(80 − 68) = 75

which is 6 feet 3 inches. The variance of Y given X = 80 is

σ^2 Y (1 − ρ^2 ) = 4(3/4) = 3.

Thus the son will tend to be of above average height, but not as tall as the father. This phenomenon is often called regression, and the line y = μY + (ρσY /σX )(x − μX ) is called the line of regression or the regression line.

Problems

  1. Let X and Y have the bivariate normal distribution. The following facts are known: μX = − 1 , σX = 2, and the best estimate of Y based on X, i.e., the estimate that minimizes the mean square error, is given by 3X +7. The minimum mean square error is 28. Find μX , σY and the correlation coefficient ρ between X and Y.
  2. Show that the bivariate normal density belongs to the exponential class, and find the corresponding complete sufficient statistic.

Lecture 23. Cram´er-Rao Inequality

23.1 A Strange Random Variable

Given a density fθ (x), −∞ < x < ∞, a < θ < b. We have found maximum likelihood estimates by computing (^) ∂θ∂ ln fθ (x). If we replace x by X, we have a random variable. To see what is going on, let’s look at a discrete example. If X takes on values x 1 , x 2 , x 3 , x 4 with p(x 1 ) =. 5 , p(x 2 ) = p(x 3 ) =. 2 , p(x 4 ) = .1, then p(X) is a random variable with the following distribution:

P {p(X) =. 5 } =. 5 , P {p(X) =. 2 } =. 4 , P {p(X) =. 1 } =. 1

For example, if X = x 2 then p(X) = p(x 2 ) = .2, and if X = x 3 then p(X) = p(x 3 ) = .2. The total probability that p(X) = .2 is .4.

The continuous case is, at first sight, easier to handle. If X has density f and X = x, then f (X) = f (x). But what is the density of f (X)? We will not need the result, but the question is interesting and is considered in Problem 1.

The following two lemmas will be needed to prove the Cram´er-Rao inequality, which can be used to compute uniformly minimum variance unbiased estimates. In the calcu- lations to follow, we are going to assume that all differentiations under the integral sign are legal.

23.2 Lemma

[ ∂

∂θ

ln fθ (X)

]

Proof. The expectation is

∫ (^) ∞

−∞

[ ∂

∂θ

ln fθ (x)

]

fθ (x) dx =

−∞

fθ (x)

∂fθ (x) ∂θ

fθ (x) dx

which reduces to

∂ ∂θ

−∞

fθ (x) dx =

∂θ

23.3 Lemma

Let Y = g(X) and assume Eθ (Y ) = k(θ). If k′(θ) = dk(θ)/dθ, then

k′(θ) = Eθ

[

Y

∂θ

ln fθ (X)

]

Proof. We have

k′(θ) =

∂θ

Eθ [g(X)] =

∂θ

−∞

g(x)fθ (x) dx =

−∞

g(x)

∂fθ (x) ∂θ

dx

Proof. Applying (23.5), we have a special case of the Cram´er-Rao inequality (23.4) with k(θ) = θ, k′(θ) = 1. ♣

The lower bound in (23.6) is 1/nI(θ), where

I(θ) = Eθ

[( ∂

∂θ

ln fθ (Xi)

) 2 ]

is called the Fisher information.

It follows from (23.6) that if Y is an unbiased estimate that meets the Cram´er-Rao inequality for all θ (an efficient estimate), then Y must be a UMVUE of θ.

23.7 A Computational Simplification

From (23.2) we have

∫ (^) ∞

−∞

∂θ

ln fθ (x)

fθ (x) dx = 0.

Differentiate again to obtain

∫ (^) ∞

−∞

∂^2 ln fθ (x) ∂θ^2

fθ (x) dx +

−∞

∂ ln fθ (x) ∂θ

∂fθ (x) ∂θ

dx = 0.

Thus ∫ (^) ∞

−∞

∂^2 ln fθ (x) ∂θ^2

fθ (x) dx +

−∞

∂ ln fθ (x) ∂θ

[ (^) ∂fθ (x) ∂θ

fθ (x)

]

fθ (x) dx = 0.

But the term in brackets on the right is ∂ ln fθ (x)/∂θ, so we have

∫ (^) ∞

−∞

∂^2 ln fθ (x) ∂θ^2

fθ (x) dx +

−∞

∂θ

ln fθ (x)

fθ (x) dx = 0.

Therefore

I(θ) = Eθ

[( ∂

∂θ ln fθ (Xi)

) 2 ]

= −Eθ

[ (^) ∂^2 ln fθ (Xi) ∂θ^2

]

Problems

  1. If X is a random variable with density f (x), explain how to find the distribution of the random variable f (X).
  2. Use the Cram´er-Rao inequality to show that the sample mean is a UMVUE of the true mean in the Bernoulli, normal (with σ^2 known) and Poisson cases.

Lecture 24. Nonparametric Statistics

We wish to make a statistical inference about a random variable X even though we know nothing at all about its underlying distribution.

24.1 Percentiles

Assume F continuous and strictly increasing. If 0 < p < 1, then the equation F (x) = p has a unique solution ξp, so that P {X ≤ ξp} = p. When p = 1/2, ξp is the median; when p = .3, ξp is the 30-th percentile, and so on.

Let X 1 ,... , Xn be iid, each with distribution function F , and let Y 1 ,... , Yn be the order statistics. We will consider the problem of estimating ξp.

24.2 Point Estimates

On the average, np of the observations will be less than ξp. (We have n Bernoulli trials, with probability of success P {Xi < ξp} = F (ξp) = p.) It seems reasonable to use Yk as an estimate of ξp, where k is approximately np. We can be a bit more precise. The random variables F (X 1 ),... , F (Xn) are iid, uniform on (0,1) [see (8.5)]. Thus F (Y 1 ),... , F (Yn) are the order statistics from a uniform (0,1) sample. We know from Lecture 6 that the density of F (Yk) is

n! (k − 1)!(n − k)!

xk−^1 (1 − x)n−k, 0 < x < 1.

Therefore

E[F (Yk)] =

0

n! (k − 1)!(n − k)!

xk(1 − x)n−k^ dx =

n! (k − 1)!(n − k)!

β(k + 1, n − k +1).

Now β(k + 1, n − k +1) = Γ( k +1)Γ( n − k +1) /Γ(n +2) = k!(n − k)!/(n +1)!, and consequently

E[F (Yk)] = k n + 1

, 1 ≤ k ≤ n.

Define Y 0 = −∞ and Yn+1 = ∞, so that

E[F (Yk+1) − F (Yk)] =

n + 1 , 0 ≤ k ≤ n.

(Note that when k = n, the expectation is 1 − [n/(n +1)] = 1 /(n +1), as asserted.)

The key point is that on the average, each [Yk, Yk+1] produces area 1/(n +1) under the density f of the Xi. This is true because

∫ (^) Yk+

Yk

f (x) dx = F (Yk+1) − F (Yk)

and we have just seen that the expectation of this quantity is 1/(n +1) , k = 0, 1 ,... , n. If we want to accumulate area p, set k/(n +1) = p, that is, k = (n +1) p.

Since we are trying to determine whether F (ξ) is equal to p 0 or greater than p 0 , we may regard θ = F (ξ) as the unknown state of nature. The power function of the test is

K(θ) = Pθ {Y ≥ c} =

∑^ n

k=c

n k

θk(1 − θ)n−k

and in particular, the significance level (probability of a type 1 error) is α = K(p 0 ).

The above confidence interval estimates and the sign test are distribution free, that is, independent of the underlying distribution function F.

Problems are deferred to Lecture 25.

Lecture 25. The Wilcoxon Test

We will need two formulas:

∑^ n

k=

k^2 =

n(n +1)(2 n +1) 6

∑^ n

k=

k^3 =

[

n(n +1) 2

] 2

For a derivation via the calculus of finite differences, see my on-line text “A Course in Commutative Algebra”, Section 5.1.

The hypothesis testing problem addressed by the Wilcoxon test is the same as that considered by the sign test, except that:

(1) We are restricted to testing the median ξ. 5.

(2) We assume that X 1 ,... , Xn are iid and the underlying density is symmetric about the median (so we are not quite nonparametric). There are many situations where we suspect an underlying normal distribution but are not sure. In such cases, the symmetry assumption may be reasonable.

(3) We use the magnitudes as well as the signs of the deviations Xi − ξ. 5 , so the Wilcoxon test should be more accurate than the sign test.

25.1 How The Test Works

Suppose we are testing H 0 : ξ. 5 = m vs. H 1 : ξ. 5 > m based on observations X 1 ,... , Xn. We rank the absolute values |Xi − m| from smallest to largest. For example, let n = 5 and X 1 − m = 2. 7 , X 2 − m = − 1. 3 , X 3 − m = − 0. 3 , X 4 − m = − 3. 2 , X 5 − m = 2.4. Then

|X 3 − m| < |X 2 − m| < |X 5 − m| < |X 1 − m| < |X 4 − m|.

Let Ri be the rank of |Xi − m|, so that R 3 = 1, R 2 = 2, R 5 = 3, R 1 = 4, R 4 = 5. Let Zi be the sign of Xi − m, so that Zi = ±1. Then Z 3 = − 1 , Z 2 = − 1 , Z 5 = 1, Z 1 = 1, Z 4 = −1. The Wilcoxon statistic is

W =

∑^ n

i=

ZiRi.

In this case, W = − 1 − 2 + 3 + 4 − 5 = −1. Because the density is symmetric about the median, if Ri is given then Zi is still equally likely to be ±1, so (R 1 ,... , Rn) and (Z 1 ,... , Zn) are independent. (Note that if Rj is given, the odds about Zi(i = j) are unaffected since the observations X 1 ,... , Xn are independent.) Now the Ri are simply a permutation of (1, 2 ,... , n), so

W is a sum of independent random variables Vi where Vi = ±i with equal probability.