Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A random vector X has a nondegenerate (multivariate) normal distribution if it has a joint PDF of the form
Typology: Lecture notes
1 / 9
6.436J/15.085J Fall 2018 Lecture 15
Contents
1.1 The definitions
Recall the following three definitions from the previous lecture.
Definition 1. A random vector X has a nondegenerate (multivariate) nor- mal distribution if it has a joint PDF of the form
1 n (x − μ)V −^1 (x − μ)T fX (x) = p exp − , (2π)n|V | 2
for some real vector μ and some positive definite matrix V.
n
Definition 2. A random vector X has a (multivariate) normal distribution if it can be expressed in the form
X = DW + μ,
for some matrix D and some real vector μ , where W is a random vector whose components are independent N (0, 1) random variables.
Definition 3. A random vector X has a (multivariate) normal distribution if for every real vector a , the random variable aT^ X is normal.
In the course of the proof of Theorem 1 in the previous lecture, we argued that if X is multivariate normal, in the sense of Definition 2, then:
(a) It also satisfies Definition 3: if X = DW + μ, where the Wi are indepen- dent, then aT^ X is a linear function of independent normals, hence normal.
(b) As long as the matrix D is nonsingular (equivalently, if Cov(X, X) = DDT is nonsingular), X also satisfies Definition 1. (We used the derived distribu- tions formula.)
We complete the proof of equivalence by establishing converses of the above two statements.
Theorem 1.
(a) If X satisfies Definition 1, then it also satisfies Definition 2.
(b) If X satisfies Definition 3, then it also satisfies Definition 2.
Proof:
(a) Suppose that X satisfies Definition 1, so in particular, the matrix V is posi- tive definite. Let D be a symmetric matrix such that D^2 = V. Since
(det(D))^2 = det(D^2 ) = det(V ) > 0 ,
we see that D is nonsingular, and therefore invertible. Let
W = D−^1 (X − μ).
Note that E[W] = 0. Furthermore,
Cov(W, W) = E[D−^1 (X − μ)(X − μ)T^ D−^1 ] = D−^1 E[(X − μ)(X − μ)T^ ]D−^1 = D−^1 V D−^1 = I.
We have shown thus far that the Wi are normal and uncorrelated. We now proceed to show that they are independent. Using the formula for the PDF of X and the change of variables formula, we find that the PDF of W is of the form
c · exp{−w^ T w/ 2 } = c · exp{(w^2 1 +^ · · ·^ +^ wn)
for some normalizing constant c, which is the joint PDF of a vector of in- dependent normal random variables. It follows that X = DW + μ is a multivariate normal in the sense of Definition 2.
(b) Suppose that X satisfies Definition 3, i.e., any linear function aT^ X is nor- mal. Let V = Cov(X, X), and let D be a symmetric matrix such that D^2 = V. We first give the proof for the easier case where V (and therefore D) is invertible. Let W = D−^1 (X − μ). As before, E[W] = 0, and Cov(W, W) = I. Fix a vector s, Then, sT^ W is a linear function of W, and is therefore normal. Note that
var(s^ T W) = E[sT WWT s] = sT^ Cov(W, W)s = sT s.
Since sT^ W is a scalar, zero mean, normal random variable, we know that
MW(s) = E[exp{s^ T W}] = M (1) = exp{var(sT W)/ 2 } = exp{sT s/ 2 }. sT^ W
We recognize that this is the transform associated with a vector of inde- pendent standard normal random variables. By the inversion property of transforms, it follows that W is a vector of independent standard normal random variables. Therefore, X = DW + μ is multivariate normal in the sense of Definition 2.
(b)′^ Suppose now that V is singular (as opposed to positive definite). For sim- plicity, we will assume that the mean of X is zero. Then, there exists some a 6 = 0, such that V a = 0, and aT^ V a = 0. Note that (^) T aT^ V a = E (a X)^2.
This implies that aT^ X = 0, with probability 1. Consequently, some com- ponent of X is a deterministic linear function of the remaining components. By possibly rearranging the components of X, let us assume that Xn is a lin- ear function of (X 1 ,... , Xn− 1 ). If the covariance matrix of (X 1 ,... , Xn− 1 ) is also singular, we repeat the same argument, until eventually a nonsingu- lar covariance matrix is obtained. At that point we have reach the situation where X is partitioned as X = (Y, Z), with Cov(Y, Y) > 0 , and with Z a linear function of Y (i.e., Z = AY, for some matrix A, with probability 1). The vector Y also satisfies Definition 3. Since its covariance matrix is non- singular, the previous part of the proof shows that it also satisfies Defini- tion 2. Let k be the dimension of Y. Then, Y = DW, where W consists of k independent standard normals, and D is a k × k matrix. Let W be a vector of n − k independent standard normals. Then, we can write Y D 0 W X = = , Z AD 0 W
which shows that X satisfies Definition 2. We should also consider the extreme possibility that in the process of elimi- nating components of X, a nonsingular covariance matrix is never obtained. But in that case, we have X = 0, which also satisfies Definition 2, with D = 0. (This is the most degenerate case of a multivariate normal.)
The last part of the proof in the previous section provides some interesting intu- ition. Given a multivariate normal vector X, we can always perform a change of coordinates, and obtain a representation of that vector in terms of independent normal random variables. Our process of going from X to W involved factoring the covariance matrix V of X in the form V = D^2 , where D was a symmetric square root of V. However, other factorizations are also possible. The most useful one is described below.
Let
W 1 = X 1 , W 2 = X 2 − E[X 2 | X 1 ], W 3 = X 3 − E[X 3 | X 1 , X 2 ], .. ..
.. Wn = Xn − E[Xn | X 1 ,... , Xn− 1 ].
(a) Each Wi can be interpreted as the new information provided by Xi, given the past, (X 1 ,... , Xi− 1 ). The Wi are sometimes called the innovations.
(b) When we deal with multivariate normals, conditional expectations are linear functions of the conditioning variables. Thus, the Wi are linear functions of the Xi. Furthermore, we have W = LX, where L is a lower triangular ma- trix (all entries above the diagonal are zero). The diagonal entries of L are all equal to 1, so L is invertible. The inverse of L is also lower triangular. This means that the transformation from X to W is causal (Wi can be de- termined from X 1 ,... , Xi) and causally invertible (Xi can be determined from W 1 ,... , Wi). Engineers sometimes call this a “causal and causally invertible whitening filter.”
(c) The Wi are independent of each other. This is a consequence of the general fact E[(X − E[X | Y ])Y ] = 0, which shows that the Wi is uncorrelated with X 1 ,... , Xi− 1 , hence uncorrelated with W 1 ,... , Wi− 1. For multivari- ate normals, we know that zero correlation implies independence. As long as the Wi have nonzero variance, we can also normalize them so that their variance is equal to 1.
(d) The covariance matrix of W, call it B, is diagonal. An easy calculation shows that Cov(X, X) = L−^1 B(L−^1 )T^. This kind of factorization into a product of a lower triangular (L−^1 B^1 /^2 ) and upper triangular (B^1 /^2 (L−^1 )T^ ) matrix is called a Cholesky factorization.
We have defined the moment generating function MX (s), for real values of s, and noted that it may be infinite for some values of s. In particular, if MX (s) = ∞ for every s 6 = 0, then the moment generating function does not provide enough information to determine the distribution of X. (As an example,
consider a PDF of the form fX (x) = c/(1 + x^2 ), where c is a suitable nor- malizing constant.) A way out of this difficulty is to consider complex values of s, and in particular, the case where√ s is a purely imaginary number: s = it, where i = − 1 , and t ∈ R. The resulting function is called the characteristic function, formally defined by
φ^ itX^ ]. X (t) =^ E[e
For example, when X is a continuous random variable with PDF f , we have Z φX (t) = eixtf (x) dx,
which very similar to the Fourier transform of f (except for the absence of a minus sign in the exponent). Thus, the relation between moment generating functions and characteristic functions is of the same kind as the relation between Laplace and Fourier trasnforms. Note that eitX^ is a complex-valued random variable, a new concept for us. However, using the relation eitX^ = cos(tX)+i sin(tX), defining its expectation is straightforward:
φX (t) = E[cos(tX)] + iE[sin(tX)].
We make a few key observations:
(a) Because |eitX^ | ≤ 1 for every t, its expectation, φX (t) is well-defined and finite for every t ∈ R. In fact, |φX (t)| ≤ 1 , for every t.
(b) The key properties of moment generating functions (cf. Lecture 14) are also valid for characteristic functions (same proof).
Theorem 2.
(a) If Y = aX + b , then φY (t) = e^ itbφX (at). (b) If X and Y are independent, then φX+Y (t) = φX (t)φY (t). (c) Let X and Y be independent random variables. Let Z be equal to X , with probability p , and equal to Y , with probability 1 − p. Then,
φZ (t) = pφX (t) + (1 − p)φY (t).
(c) Inversion theorem: If two random variables have the same characteristic function, then their distributions are the same. We prove this result below.
(d) The above inversion theorem remains valid for multivariate characteristic functions, defined by φX(t) = E[eit T (^) X ].
(e) For the univariate case, if X is a continuous random variable with PDF fX , there is an explicit inversion formula, namely Z (^) T fX (x) =
lim e−itxφX (t) dt, 2 π T →∞ (^) −T for every x at which fX is differentiable. (Note the similarity with inversion formulas for Fourier transforms.) (f) The dominated convergence theorem can be applied to complex random variables (simply apply the DCT separately to the complex and imaginary parts). Thus, if limn→∞ Xn = X, a.s., then, for every t ∈ R, itXn ^ nlim→∞^ φXn^ (t) =^ nlim→∞^ E[e^ ] =^ E^ nlim→∞^ eitX^ n^ =^ E[eitX^ ] =^ φX^ (t). The DCT applies here, because the random variables |eitXn^ | are bounded by 1.
(g) If E[|X|k] < ∞, then φX (t) is k-times continuously differentiable and also
dk φX (t) = ikE[Xk]. dtk^ t= (This is plausible, by moving the differentiation inside the expectation, but a formal justification is needed.)
(h) If E[eǫ|X|] < ∞ for some ǫ > 0 (equivalently if MGF of X exists in a neigborhood of zero) then φX (t) is analytic function of t, which extends to all complex z inside a strip {z : −ǫ < Im z < ǫ}.
Two useful characteristic functions:
(a) Exponential: If fX (x) = λe−λx^ , x ≥ 0 , then
λ φX (y) =. λ − it Note that this is the same as starting with MX (s) = λ/(λ− s) and replacing s by it; however, this is not a valid proof. One must either use tools from complex analysis (contour integration), or evaluate separately E[cos(tX)], E[sin(tX)], which can be done using integration by parts.
(b) Normal (scalar): If X =^ d N (μ, σ^2 ), then
φX (t) = e^ itμ e− t^2 σ^2 /^2.
4.1 Inversion theorem
Theorem 3 (Inversion theorem). Let X and Y have the same characteristic functions. Then PX = PY.
Proof. Let a > 1 and consider the following “trapezoidal function” 0 , |x| ≥ a fa(x) =
a− 1 (x^ +^ a), −a^ <^ x^ <^ −^1 1 , − 1 ≤ x ≤ 1 1 − (^) a− 1 (x − a), 1 < x < a
Note that lim fa(x) = (^1) − 1 ,1 a→1+
Furthermore, there is an identity Z fa(x) =
(a − 1) 2 π
e^ − itx^1 R t^2
sin^2 ta a 2
− sin^2 t 2
dt (2)
To show this you may either compute the integral directly or use Fourier inver- sion and the observation that fa = 1 a− 1 (g^ ∗^ g^ −^ h^ ∗^ h), where^ g^ = 1[−a/^2 ,a/2], h = 1[− 1 ,1] and ∗ is convolution. Note that the integral in (2) is absolutely convergent since the absolute value of the integrand 1 1 sin^2 ta − sin^2 t t^2 a 2 2
is continuous at 0 and integrable at +∞. Thus, by Fubini we have
Z^ ^ 4 1 1 E[fa(X)] = √ φX (−t) sin^2 ta − sin^2 t dt (a − 1) 2 π (^) R t^2 a^2
Since φX = φY we have
E[fa(X)] = E[fa(Y )]
for every a > 1. Taking limit as a ց 1 and applying the BCT to (1) we get
PX ([− 1 , 1]) = PY ([− 1 , 1]) A similar argument (with shifted and scaled fa) shows that PX and PY co- incide on every closed interval. Since the collection of closed intervals is a generating p-system, we have PX = PY.
4.2 Vector-valued random variables
A very useful extension is to define characteristic function for vector-valued random variable X = (X 1 ,... , Xd)T^ ∈ Rd^. In this case characteristic function is defined on Rd^ as follows: h i φX(t) = E e^ i tT^ X^ , t = (t 1 ,... , td)T^ ∈ Rd
where tT^ X =
j
d =1 tj^ Xj^ denotes a standard scalar product on^ R d (^). Most of the properties and results above (including inversion theorem) carry over to the vector case. This leads to numerous useful implications, of which we discuss two:
φX(t) = e^ iμ T^ t−^12 tT^ V^ t (4)
where μ = E[X] and V = Cov(X, X). Since φ uniquely determines the distribution, property (4) is frequently taken as the definition of a multi- variate normal. Most properties then follow immediately. For example, “uncorrelated implies independent” is just a consequence of (3).