









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of expectations of functions of two random variables, x and y. It covers the definition of joint probability density function (p.d.f.) and joint cumulative distribution function (c.d.f.), their properties, and the definition of independence of random variables. The document also covers the concept of characteristic functions and their relationship to joint p.d.f. And marginal p.d.f. Lastly, it discusses the definition of conditional expectation and its relationship to conditional probability.
Typology: Study notes
1 / 16
This page cannot be seen from the preview
Don't miss anything!










Readings from G&S : Section 3.3, Section 3.4, Section 3.6, Section 3.7, Section 4.3, Section 4.6, Section 5.1, Section 5.2, Section 5.6, Section 5.7, Section 5.8,
When we say “expectation,” we mean “average,” the average being roughly what you would think of (i.e., the arithmetic average, as opposed to a median or mode). For a discrete r.v. X, we define the expectation as
E[X] =
i
xipX (xi )
For a continuous r.v., we define the expectation as
−∞
fX (x)x dx
Now a bit of technicality regarding integration, which introduces notation commonly used. When you integrate, you are typically doing a Riemann integral:
∫ (^) b
a
xfX (x) dx = lim max 1 ≤i≤n− 1 |xi+1−xi|→ 0
∑^ n
i=
zifX (zi)(xi+1 − xi)
a = x 1 < x 2 < · · · < xn = b zi ∈ (xi, xi+1)
In other words, we break up the interval into little slices and add up the vertical rectangular pieces. Another way of writing this is to recognize that
zifX (zi)(xi+1 − xi) ≈ ziP (xi < X ≤ Xi+1) = zi(FX (xi+1) − FX (xi))
and that in the limit, the approximation becomes exact. Note, however, that this is ex- pressed in terms of the c.d.f., not the p.d.f., and so exists for all random variables, not just continuous ones. This gives rise to what is known as the Riemann-Stieltjes Integral:
E[X] = lim max 1 ≤i≤n− 1 (xi+1−xi)→ 0
∑^ n
i=
zi[FX (xi+1 ) − FX (xi)].
We write the limit as (^) ∫ b
a
xdFX (x)
This notation “describes” continuous, discrete, and mixed cases. That is,
E[X] =
−∞
xdFX (x).
We have defined the Riemann-Stieltjes integral in a context of expectation. However, it has a more general definition: ∫ (^) b
a
f (x)dg(x) = lim max(xi+1−xi)→ 0
i
f (zi)[g(xi+1) − g(xi)]
When g(x) = x, this reduces to the ordinary Riemann integral. Sufficient conditions for existence:
or
The first case covers the case of expectation. In a directly analogous way we define ∫ (^) ∞
−∞
g(x)dFX (x) = lim
∑^ n
i=
g(zi)[FX (xi+1) − FX (xi )].
Now consider the r.v. Y = g(X).
−∞
ydFY (y)
Note that dFY (y) is the representation of the limiting value
FY (yi+1) − FY (yi) = P r(yi < Y ≤ yi+1) = P r(yi < g(X) ≤ yi+1) = P r(g−^1 (yi) < X ≤ g−^1 (yi+1)) = P r(xi < X ≤ xi+1 )
which, in the limit is equal to dFX (x), when y = g(x). Thus
−∞
ydFY (y) =
∞
g(x)dFX (x)
Let us put this in more familiar terms: If Y = g(x), then
−∞
g(x)fX (x) dx (1)
One might think that finding E[Y ] would require finding fY (y). However, as (1) shows, all that is necessary is to substitute g(x) for x in the expectation. This is sometimes called the law of the unconcious statistitian , since it can be done nearly thoughtlessly. An interesting result is obtained through the use of indicator functions. Let I : (F, Ω) → R be defined by
IA(ω) =
1 ω ∈ A 0 ω 6 ∈ A
In other words, the indicator function indicates which its argument is in the set which is the subscripted argument. We define a simple function as one which is a linear combination of indicator func- tions: For some collection A 1 , A 2 ,... , An ∈ F,
g(ω) =
∑^ n
k=
bk IAk (ω)
This gives us a piecewise-constant function on Ω. It also defines a random variable. Note that the collection need not be disjoint. However, we can shuffle things around to write the function as
g(ω) =
∑^ n∗
k=
b∗ k IA∗ k (ω)
If a = −b, then the integral is zero, no matter what. If you fix a or b, taking the limit of the other one, the result is ∞. That is, E[X+] exists and E[X−] exists (although both are ∞), but they can’t be subtracted. 2
E[X] acts kind of like an integral of X(ω) over Ω, weighted by P. One way that the expectation is expressed is
Ω
X(ω)P (dω) =
Ω
XdP.
An integral in this form is said to be a Lebesgue-Stieltjes Integral. Since X induces a probability PX on (R, B), as we have observed we can also think of the probability space (R, B, PX ). We can write
E[X] =
R
xPX (dx)
where now X is the “identity” r.v. on the real line. We thus have two equivalent definitions: ∫
Ω
X(ω)P (dω) =
R
xPX (dx)
Back to properties:
Ω
(g ◦ X)(ω)P (dω) =
R
g(x)PX (dx) =
R
yPY (dy)
Ultimately, we will be dealing with infinite sequences of random variables. As steps along the way, we will examine carefully pairs of random variables, then vectors of random variables. On R^2 , the smallest σ-field of interest is B^2 , which is the smallest σ-field containing all of the rectangles. This is the Borel σ-field of R^2.
Definition 1 A bivariate random variable (X, Y ) is a measurable mapping from (Ω, F) to (R^2 , B^2 ). 2 That is, {ω ∈ Ω : (X, Y )(ω) ∈ B} ∈ F∀B ∈ B^2 Note that two r.v.s X, Y on (Ω, F) form a bivariate r.v.
Definition 2 The joint or bivariate distribution of (X, Y ) is
PXY (B) = P ({ω ∈ Ω : (X, Y )(ω) ∈ B})
for B ∈ B^2. 2
Definition 3 The joint c.d.f. of (X, Y ) is defined as
FXY (a, b) = P (X ≤ a, Y ≤ b) = P ((X, Y ) ∈ Ra,b)
where Rab is the semi-infinite rectangle
Ra,b = {(x, y) ∈ R^2 : x ≤ a, y ≤ b}.
2 Properties of the joint c.d.f.:
Any function with these properties is a legitimate c.d.f., and completely characterizes the family of joint c.d.f.s.
Definition 4 If X, Y are discrete r.v.s taking values in sets {x 1 ,.. .} and {y 1 ,... , }, re- spectively, then (X, Y ) forms a discrete bivariate r.v. and its joint p.m.f. is defined by
pXY (a, b) = P (X = a, Y = b)
2 Properties of pXY :
pXY ≥ 0 , and pXY (a, b) = 0 if a 6 ∈ {x 1 ,.. .} or b 6 ∈ {y 1 ,.. .}
i=
j=1 pXY^ (xi^ , yj^ ) = 1.
{xi,yi}:xi≤a,yj ≤b} pXY^ (xi, yj^ )
j
pXY (xi, yj )
PY (yj ) =
i
pXY (xi , yj )
Definition 5 X and Y are jointly continuous r.v.s if there is a function fXY : R^2 → R^2 such that
FXY (a, b) =
∫ (^) b
∞
∫ (^) a
∞
fXY (x, y) dxdy
for all (a, b) ∈ R^2. The function fXY is called the joint p.d.f. of X and Y (when it exists). 2 Properties of joint p.d.f.:
fXY ≥ 0.
−∞
−∞ fXY^ (x, y)^ dx, dy^ = 1.
Let g : R^2 → R be measurable (e.g., {(x, y) ∈ R^2 : g(x, y) ∈ B} ∈ B^2 ∀B ∈ B^2 ). Then for a bivariate r.v. (X, Y ) we can define Z = g(X, Y ).
−∞
zdFZ (z) =
R
g(x, y)PXY (dx dxy)
−∞
∑ −∞^ g(x, y)fXY^ (x, y)^ dxdy^ (X, Y^ )^ jointly continuous i
j g(xi, yj^ )pXY^ (xi, yj^ )^ (X, Y^ )^ discrete
Properties:
E[g 1 (X)g 2 (Y )] = E[g 1 (X)]E[g 2 (Y )]∀(measurable, well-defined)g 1 , g 2.
Comments: If X and Y are independent, then
E[XY ] = E[X]E[Y ].
However, if E[X]E[Y ] = E[XY ], this does not mean that they are independent. (Uncorrelated does not imply independence.) However, if E[g 1 (X)g 2 (Y )] = E[g 1 (X)]E[g 2 (Y )] for all appropriate functions, then X and Y are independent. In fact, this is necessary and sufficient for indepen- dence.
Definition 7 The covariance of X and Y are is defined as
cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
The variance of X is defined as
var(X) = cov(X, X).
2
var(aX) = a^2 var(X).
Definition 8 If 0 < var(X) < ∞ and 0 < var(Y ) < ∞, the correlation coeffiicent between X and Y is
ρ(X, Y ) = cov(X, Y ) √ var(X) var(Y )
This is a normalized version of the covariance. 2
X = aY + b
for some constants (a, b) with a 6 = 0.
Example 3 If (X, Y ) ∼ N (μx, μy , σ x^2 , σ y^2 , ρ), then ρ(X, Y ) = ρ. 2 As we have observed before, if X, Y are jointly Gaussian and ρ = 0, then they are independent. Otherwise, ρ = 0 does not imply independence.
The characteristic function is essentially the Fourier transform of the p.d.f. or p.m.f. They are useful in practice not for the usual reasons engineers use Fourier transforms (e.g., fre- quency content), but because they can provide a means of computing moments (as we will see), and they are useful in finding distributions of sums of independent random variables.
Definition 9 Let X be a r.v. The characteristic function (ch.f.) of X is
φX (u) = E[eiux]
for u ∈ R. (Here, i =
− 1. We will not use
−1 = j.) 2 Let us write some more explicit formulas. Suppose X is a continuous random variable. Then (by the law of the unconcious statistician)
φX (u) =
−∞
eiuxfX (x) dx.
This may be recognized as the Fourier transform of fX (x), where u is the “frequency” variable. (Comment on sign of exponent.) Note that given φX we can determine fX by an inverse Fourier transform: If X is a discrete r.v., φX (u) =
i
eiuxi^ pX (xi ),
which we recognize as the discrete-time Fourier transform, and as before u is the “fre- quency” variable. (Comment on the sign of the exponent.) Given a φX , we can find pX by the inverse discrete-time Fourier transform. Properties:
fX ↔ φX.
Thus, φX provides yet another way of displaying the probability structure of X.
−∞ e
iuxdFX (x). This is referred to as the Fourier-Stieltjes transform of FX.
Let X and Y be independent r.v.s, and let
Z = X + Y.
Then φZ (u) = E[exp(iuz)] = E[exp(iuX + iuY )] = φX,Y (u, u).
But also
E[exp(iuX + iuY )] = E[exp(iuX) exp(iuY )] = φX (u)φY (u).
So φZ (u) = φX (u)φY (u)
If X and Y are continuous r.v.s, then so is Z.
fZ (z) = F−^1 [φZ (u)] = F−^1 [φX (z)φY (z)] = fX (z) ∗ fY (z)
by the convolution theorem. Thus, when continuous independent random variables are added, the p.d.f of the sum is the convolution of the p.d.f.s (and respectively p.m.f. for discrete independent r.v.s).
If (X, Y ) ∼ N (μx, μy , σ^2 x, σ^2 y , ρ), then
φX,Y (u, v) = exp[i(uμx + vμy ) −
(u^2 σ x^2 + v^2 σ^2 y + 2uvρσxσy )]
We make an observation here: the “form” of the Gaussian p.d.f. is the exponential of quadratics. The form of the Fourier transform of the eponential of quadratics is of the form exponential of quadratics. This little fact gives rise to much of the analytical and practical usefulness of Gaussian r.v.s.
We observe that φX,Y (u, 0) = φX (u)
In our Gaussian example, we have
φX (u) = φX,Y (u, 0) = exp(iuμx − σ^2 xu^2 /2)
which is the ch.f. for a Gaussian,
X ∼ N (μx, σ x^2 ).
We could, of course, have obtained a similar result via integration, but this is much easier.
In general, when we observe an outcome of a random variable, we “expect” it to be near the mean (that is, near the expected value). Further, the farther the outcome is from the mean, the less likely we expect the outcome to be. There are some very useful probabilities which quantize these intuitive “expectations.” These are the Markov inequality, and its
consequences, the Chebyshev inequality and the Chernoff bound. We will introduce these here. Let B ∈ B for a Borel set B. Recall that
IB (x) =
1 x ∈ B 0 x 6 ∈ B.
Let X be a random variable, and let Y = IB (X). This is a measurable function, so Y is another random variable.
E[Y ] = PX (B) = P (X ∈ B).
We will use this “expectation as probability” idea to get a bound. Suppose g is a nonnegative, nondecreasing function. Suppose b ∈ R with g(b) > 0. Consider the function
h(x) = g(x) g(b)
≥ 1 x ≥ b ≥ 0 ∀x.
Observe that h(x) ≥ I[b,∞)(x)
for all x, since
I[b,∞)(x) =
1 x ≥ b 0 x < b.
Note that h(b) = I[b,∞)(b). Now we have E[h(X)] ≥ E[I[b,∞)(X)] = P (X ≥ b).
Also,
E[h(X)] =
E[g(X)] g(b)
Thus
P (X ≥ b) ≤
E[g(X)] g(b)
A similar result can be established if g is nonnegative, nondecreasing on [0, ∞) and sym- metric about 0. We can thus establish that
P (|X| ≥ b) ≤ E[g(X)] g(b)
Special case: Assume X ≥ 0 , and let
g(x) =
x x ≥ 0 0 x < 0.
The inequality above gives rise to the Markov inequality :
P (X ≥ b) ≤
b
for all b > 0. Somewhat more generally, the Markov inequality says
P (|X| ≥ b) ≤
b
This is an inequality that holds in any Hilbert space. It more or less forms the theme for the first several weeks of 6030.
Theorem 3 If E[X^2 ] < ∞ and E[Y 2 ] < ∞ then
|E[XY ]|^2 ≤ E[X^2 ]E[Y 2 ].
For example, | cov(X, Y )|^2 ≤ var(X) var(Y )
implying |ρ| ≤ 1. Observe that var(X) = E[X^2 ] − E[X]^2 ≥ 0 using Schwartz and Jensen inequalities.
Suppose X is a discrete r.v. Now we define the conditional distribution of another r.v. Y given X = xk (at some point where P (X = xk ) > 0 ) by
FY |X (y|xk ) = P (Y ≤ y|X = xk ) =
P (Y ≤ y, X = xk ) P (X = xk )
By the law of total probability,
FY (y) =
k
FY |X (y|xk )pX (xk ).
As we discussed before, when we condition on an event, we are shrinking the sample space under consideration. So there is some normalization that takes place. We also define E[Y |X = xk ] =
−∞
ydFY |X (y|xk ).
Note that this depends on the value of xk ; it is a function of xk. Let us now take the expectation with respect to X:
EX [E[Y |X = xk ]] =
k
E[Y |X = xk ]PX (xk ) = E[Y ].
We can think of E[Y |X = xk ] as a discrete random variable that is a function of X. For a discrete r.v. X, the function FY |X (y|xk ) could be either a discrete or a continuous r.v. Discrete: pY |X (y|xk ) = P (Y = y|X = xk )
Continuous: There exists a function fY |X such that
FY |X (y|xk ) =
∫ (^) y
−∞
fY |X (z|xk ) dz.
We can also write FY |X (y|xk ) = EI(−∞,y|X = xk ] If Y is discrete we have
pY |X (y|xk ) =
P (Y = y, X = xk ) px(xk )
pxy (xk , y) px(xk )
When X is a continuous r.v., conditional probabilities and expectations are somewhat more complicated, because P (X = xk) = 0 for any particular value of x. Recall that E[Y |Xk ] = g(xk ) for some function g, and E[g(x)] = E[Y ].
Definition 14 Suppose Y is an r.v. on the probability space (Ω, F, P ) with E[|Y |] < ∞. Then for A ∈ F, define (^) ∫
A
Y dP = E[IA(Y )].
2
Definition 15 Suppose X and Y are random variables and E[|Y |] < ∞. The conditional expectation of Y given X = x is any measurable function g(x) = E[Y |X = x] of x satisfying (^) ∫
B
E[Y |X = x]PX (dx) =
X−^1 (B)
Y dP (2)
for all B ∈ B, where X−^1 (B) = {ω ∈ Ω : X(ω) ∈ B}. 2
When a condition is true with probability 1, we say that it is true “almost surely,” or “a.s.” Once we have defined conditional expectation, we can define a conditional c.d.f.:
FY |X (y|x) = EI(∞,y|X = x].
Properties:
R FY^ |X^ (y|x)PX^ (dx).
fY |X (y|x) =
fXY (x, y) fX (x)
There is another interpretation:
FY |X (y|x) = lim ∆x→ 0 +^
P (Y ≤ y|x − ∆x/ 2 < X ≤ x + ∆x/2)
lim∆x→ 0 + P (Y ≤ y, x − ∆/ 2 < X ≤ x + ∆x/2)/∆x lim∆x→ 0 +^ P (x − ∆/ 2 < X ≤ x + ∆x/2)/∆x
∂x FXY (x, y) ∂ ∂x
FX (x)
∂x FXY (x, y) fx(x)
Definition 16 If X is an r.v., define σ(X) (the σ-field generated by X) to be
{{ω ∈ Ω : X(ω) ∈ B} for B ∈ B}.
2 Fact: A r.v. Y is measurable with respect to σ(X) if and only if there is a measurable function g : R → R such that Y = g(X). We now define conditional expectation with respect to a σ-field:
Definition 17 If X and Y are r.v.s with E[|Y |] < ∞, we define
E[Y |X] = E[Y |σ(X)]
2 Properties:
E[Y |X] = g(x)
for some function g, g(x) = E[Y |X = x].
E[E[Y |E]] = E[Y |G].
Idea: If you first condition on a field that is less “course” than G you get a r.v. Then condition on G.
Definition 18 Two σ-fields G and H are independent if
P (GH) = P (G)P (H) for all G ∈ G and H ∈ H.
2 Note: X and Y are independent r.v.s iff σ(X) and σ(Y ) are independent σ-fields.