Lecture 2: More on Random Variables - Expectations of Functions of Two Random Variables, Study notes of Stochastic Processes

The concept of expectations of functions of two random variables, x and y. It covers the definition of joint probability density function (p.d.f.) and joint cumulative distribution function (c.d.f.), their properties, and the definition of independence of random variables. The document also covers the concept of characteristic functions and their relationship to joint p.d.f. And marginal p.d.f. Lastly, it discusses the definition of conditional expectation and its relationship to conditional probability.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-dls
koofers-user-dls 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECE 6010
Lecture 2 More on Random Variables
Readings from G&S: Section 3.3, Section 3.4, Section 3.6, Section 3.7, Section 4.3,
Section 4.6, Section 5.1, Section 5.2, Section 5.6, Section 5.7, Section 5.8,
Expectation
When we say “expectation,”we mean “average,” the averagebeing roughly what you would
think of (i.e., the arithmetic average, as opposed to a median or mode). For a discrete r.v.
X, we define the expectation as
E[X] = X
i
xipX(xi)
For a continuous r.v., we define the expectation as
E[X] = Z
−∞
fX(x)x dx
Now a bit of technicality regarding integration, which introduces notation commonly
used. When you integrate, you are typically doing a Riemann integral:
Zb
a
xfX(x)dx = lim
max1in1|xi+1xi|→0
n
X
i=1
zifX(zi)(xi+1 xi)
a=x1< x2<···< xn=b zi(xi, xi+1)
In other words, we break up the interval into little slices and add up the vertical rectangular
pieces.
Another way of writing this is to recognize that
zifX(zi)(xi+1 xi)ziP(xi< X Xi+1 ) = zi(FX(xi+1 )FX(xi))
and that in the limit, the approximation becomes exact. Note, however, that this is ex-
pressed in terms of the c.d.f., not the p.d.f., and so exists for all randomvariables, not just
continuous ones.
This gives rise to what is known as the Riemann-Stieltjes Integral:
E[X] = lim
max1in1(xi+1xi)0
n
X
i=1
zi[FX(xi+1)FX(xi)].
We write the limit as
Zb
a
xdFX(x)
This notation “describes” continuous, discrete, and mixed cases. That is,
E[X] = Z
−∞
xdFX(x).
We have defined the Riemann-Stieltjes integralin a context of expectation. However, it
has a more general definition:
Zb
a
f(x)dg(x) = lim
max(xi+1xi)0X
i
f(zi)[g(xi+1)g(xi)]
When g(x) = x, this reduces to the ordinary Riemann integral. Sufficient conditions for
existence:
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Lecture 2: More on Random Variables - Expectations of Functions of Two Random Variables and more Study notes Stochastic Processes in PDF only on Docsity!

ECE 6010

Lecture 2 – More on Random Variables

Readings from G&S : Section 3.3, Section 3.4, Section 3.6, Section 3.7, Section 4.3, Section 4.6, Section 5.1, Section 5.2, Section 5.6, Section 5.7, Section 5.8,

Expectation

When we say “expectation,” we mean “average,” the average being roughly what you would think of (i.e., the arithmetic average, as opposed to a median or mode). For a discrete r.v. X, we define the expectation as

E[X] =

i

xipX (xi )

For a continuous r.v., we define the expectation as

E[X] =

−∞

fX (x)x dx

Now a bit of technicality regarding integration, which introduces notation commonly used. When you integrate, you are typically doing a Riemann integral:

∫ (^) b

a

xfX (x) dx = lim max 1 ≤i≤n− 1 |xi+1−xi|→ 0

∑^ n

i=

zifX (zi)(xi+1 − xi)

a = x 1 < x 2 < · · · < xn = b zi ∈ (xi, xi+1)

In other words, we break up the interval into little slices and add up the vertical rectangular pieces. Another way of writing this is to recognize that

zifX (zi)(xi+1 − xi) ≈ ziP (xi < X ≤ Xi+1) = zi(FX (xi+1) − FX (xi))

and that in the limit, the approximation becomes exact. Note, however, that this is ex- pressed in terms of the c.d.f., not the p.d.f., and so exists for all random variables, not just continuous ones. This gives rise to what is known as the Riemann-Stieltjes Integral:

E[X] = lim max 1 ≤i≤n− 1 (xi+1−xi)→ 0

∑^ n

i=

zi[FX (xi+1 ) − FX (xi)].

We write the limit as (^) ∫ b

a

xdFX (x)

This notation “describes” continuous, discrete, and mixed cases. That is,

E[X] =

−∞

xdFX (x).

We have defined the Riemann-Stieltjes integral in a context of expectation. However, it has a more general definition: ∫ (^) b

a

f (x)dg(x) = lim max(xi+1−xi)→ 0

i

f (zi)[g(xi+1) − g(xi)]

When g(x) = x, this reduces to the ordinary Riemann integral. Sufficient conditions for existence:

  • g(x) of bounded variation
  • and f (x) continuous on [a, b]

or

  • f (x) of bounded variation
  • g(x) continuous

The first case covers the case of expectation. In a directly analogous way we define ∫ (^) ∞

−∞

g(x)dFX (x) = lim

∑^ n

i=

g(zi)[FX (xi+1) − FX (xi )].

Now consider the r.v. Y = g(X).

E[Y ] =

−∞

ydFY (y)

Note that dFY (y) is the representation of the limiting value

FY (yi+1) − FY (yi) = P r(yi < Y ≤ yi+1) = P r(yi < g(X) ≤ yi+1) = P r(g−^1 (yi) < X ≤ g−^1 (yi+1)) = P r(xi < X ≤ xi+1 )

which, in the limit is equal to dFX (x), when y = g(x). Thus

E[Y ] =

−∞

ydFY (y) =

g(x)dFX (x)

Let us put this in more familiar terms: If Y = g(x), then

E[Y ] =

−∞

g(x)fX (x) dx (1)

One might think that finding E[Y ] would require finding fY (y). However, as (1) shows, all that is necessary is to substitute g(x) for x in the expectation. This is sometimes called the law of the unconcious statistitian , since it can be done nearly thoughtlessly. An interesting result is obtained through the use of indicator functions. Let I : (F, Ω) → R be defined by

IA(ω) =

1 ω ∈ A 0 ω 6 ∈ A

In other words, the indicator function indicates which its argument is in the set which is the subscripted argument. We define a simple function as one which is a linear combination of indicator func- tions: For some collection A 1 , A 2 ,... , An ∈ F,

g(ω) =

∑^ n

k=

bk IAk (ω)

This gives us a piecewise-constant function on Ω. It also defines a random variable. Note that the collection need not be disjoint. However, we can shuffle things around to write the function as

g(ω) =

∑^ n∗

k=

b∗ k IA∗ k (ω)

If a = −b, then the integral is zero, no matter what. If you fix a or b, taking the limit of the other one, the result is ∞. That is, E[X+] exists and E[X−] exists (although both are ∞), but they can’t be subtracted. 2

Properties of Expectations

  1. If X = c then E[X] = c.
  2. If Y = aX + b then E[Y ] = aE[X] + b.

E[X] acts kind of like an integral of X(ω) over Ω, weighted by P. One way that the expectation is expressed is

E[X] =

Ω

X(ω)P (dω) =

Ω

XdP.

An integral in this form is said to be a Lebesgue-Stieltjes Integral. Since X induces a probability PX on (R, B), as we have observed we can also think of the probability space (R, B, PX ). We can write

E[X] =

R

xPX (dx)

where now X is the “identity” r.v. on the real line. We thus have two equivalent definitions: ∫

Ω

X(ω)P (dω) =

R

xPX (dx)

Back to properties:

  1. If Y = g ◦ X then

E[Y ] =

Ω

(g ◦ X)(ω)P (dω) =

R

g(x)PX (dx) =

R

yPY (dy)

Pairs of random variables

Ultimately, we will be dealing with infinite sequences of random variables. As steps along the way, we will examine carefully pairs of random variables, then vectors of random variables. On R^2 , the smallest σ-field of interest is B^2 , which is the smallest σ-field containing all of the rectangles. This is the Borel σ-field of R^2.

Definition 1 A bivariate random variable (X, Y ) is a measurable mapping from (Ω, F) to (R^2 , B^2 ). 2 That is, {ω ∈ Ω : (X, Y )(ω) ∈ B} ∈ F∀B ∈ B^2 Note that two r.v.s X, Y on (Ω, F) form a bivariate r.v.

Definition 2 The joint or bivariate distribution of (X, Y ) is

PXY (B) = P ({ω ∈ Ω : (X, Y )(ω) ∈ B})

for B ∈ B^2. 2

Definition 3 The joint c.d.f. of (X, Y ) is defined as

FXY (a, b) = P (X ≤ a, Y ≤ b) = P ((X, Y ) ∈ Ra,b)

where Rab is the semi-infinite rectangle

Ra,b = {(x, y) ∈ R^2 : x ≤ a, y ≤ b}.

2 Properties of the joint c.d.f.:

  1. lima,b→∞ FX,Y (a, b) = 1.
  2. lima→−∞ FX,Y (a, b) = 0 = limb→−∞ FX,Y (a, b).
  3. lima→∞ FX,Y (a, b) = FY (b), the marginal c.d.f. of Y. limb→∞ FX,Y (a, b) = FX (a), the marginal c.d.f. of X.
  4. FXY (a, b) is continuous “from the northeast.”
  5. FXY (x, y) is montonically increasing (or, more precisely, nondecreasing) in both variables.

Any function with these properties is a legitimate c.d.f., and completely characterizes the family of joint c.d.f.s.

Joint discrete r.v.s

Definition 4 If X, Y are discrete r.v.s taking values in sets {x 1 ,.. .} and {y 1 ,... , }, re- spectively, then (X, Y ) forms a discrete bivariate r.v. and its joint p.m.f. is defined by

pXY (a, b) = P (X = a, Y = b)

2 Properties of pXY :

  1. pXY ≥ 0 , and pXY (a, b) = 0 if a 6 ∈ {x 1 ,.. .} or b 6 ∈ {y 1 ,.. .}

i=

j=1 pXY^ (xi^ , yj^ ) = 1.

  1. FXY (a, b) =

{xi,yi}:xi≤a,yj ≤b} pXY^ (xi, yj^ )

  1. Marginals: PX (xi) =

j

pXY (xi, yj )

PY (yj ) =

i

pXY (xi , yj )

Joint continuous r.v.s

Definition 5 X and Y are jointly continuous r.v.s if there is a function fXY : R^2 → R^2 such that

FXY (a, b) =

∫ (^) b

∫ (^) a

fXY (x, y) dxdy

for all (a, b) ∈ R^2. The function fXY is called the joint p.d.f. of X and Y (when it exists). 2 Properties of joint p.d.f.:

  1. fXY ≥ 0.

−∞

−∞ fXY^ (x, y)^ dx, dy^ = 1.

Expectations of functions of two r.v.s

Let g : R^2 → R be measurable (e.g., {(x, y) ∈ R^2 : g(x, y) ∈ B} ∈ B^2 ∀B ∈ B^2 ). Then for a bivariate r.v. (X, Y ) we can define Z = g(X, Y ).

E[Z] =

−∞

zdFZ (z) =

R

g(x, y)PXY (dx dxy)

−∞

∑ −∞^ g(x, y)fXY^ (x, y)^ dxdy^ (X, Y^ )^ jointly continuous i

j g(xi, yj^ )pXY^ (xi, yj^ )^ (X, Y^ )^ discrete

Properties:

  1. E[X + Y ] = E[X] + E[Y ]
  2. If X ≥ Y then E[X] ≥ E[Y ].
  3. If X and Y are independent then

E[g 1 (X)g 2 (Y )] = E[g 1 (X)]E[g 2 (Y )]∀(measurable, well-defined)g 1 , g 2.

Comments: If X and Y are independent, then

E[XY ] = E[X]E[Y ].

However, if E[X]E[Y ] = E[XY ], this does not mean that they are independent. (Uncorrelated does not imply independence.) However, if E[g 1 (X)g 2 (Y )] = E[g 1 (X)]E[g 2 (Y )] for all appropriate functions, then X and Y are independent. In fact, this is necessary and sufficient for indepen- dence.

Definition 7 The covariance of X and Y are is defined as

cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

The variance of X is defined as

var(X) = cov(X, X).

2

  1. cov(X, Y ) = E[XY ] − E[X]E[Y ]. var(X) = E[X^2 ] − (E[X])^2.
  2. If X and Y are independent then cov(X, Y ) = 0. If cov(X, Y ) = 0, we say that X and Y are uncorrelated. Again, uncorrelated does not imply independence.
  3. var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ). If cov(X, Y ) = 0 then var(X + Y ) = var(X) + var(Y ).
  4. cov(aX + b, cY + d) = ac cov(X, Y ) for all constants a, b, c, d ∈ R. Thus

var(aX) = a^2 var(X).

Definition 8 If 0 < var(X) < ∞ and 0 < var(Y ) < ∞, the correlation coeffiicent between X and Y is

ρ(X, Y ) = cov(X, Y ) √ var(X) var(Y )

This is a normalized version of the covariance. 2

  1. |ρ| ≤ 1. This can be shown using the Cauchy-Schwartz inequality. |ρ| = 1 iff X and Y are linearly related,

X = aY + b

for some constants (a, b) with a 6 = 0.

Example 3 If (X, Y ) ∼ N (μx, μy , σ x^2 , σ y^2 , ρ), then ρ(X, Y ) = ρ. 2 As we have observed before, if X, Y are jointly Gaussian and ρ = 0, then they are independent. Otherwise, ρ = 0 does not imply independence.

Characteristic functions

The characteristic function is essentially the Fourier transform of the p.d.f. or p.m.f. They are useful in practice not for the usual reasons engineers use Fourier transforms (e.g., fre- quency content), but because they can provide a means of computing moments (as we will see), and they are useful in finding distributions of sums of independent random variables.

Definition 9 Let X be a r.v. The characteristic function (ch.f.) of X is

φX (u) = E[eiux]

for u ∈ R. (Here, i =

− 1. We will not use

−1 = j.) 2 Let us write some more explicit formulas. Suppose X is a continuous random variable. Then (by the law of the unconcious statistician)

φX (u) =

−∞

eiuxfX (x) dx.

This may be recognized as the Fourier transform of fX (x), where u is the “frequency” variable. (Comment on sign of exponent.) Note that given φX we can determine fX by an inverse Fourier transform: If X is a discrete r.v., φX (u) =

i

eiuxi^ pX (xi ),

which we recognize as the discrete-time Fourier transform, and as before u is the “fre- quency” variable. (Comment on the sign of the exponent.) Given a φX , we can find pX by the inverse discrete-time Fourier transform. Properties:

  1. φX (0) = 1. (Why?)
  2. |φX (u)| ≤ 1 ∀u. (Why?)
  3. φX and fX form a unique Fourier transform pair.

fX ↔ φX.

Thus, φX provides yet another way of displaying the probability structure of X.

  1. φX (u) =

−∞ e

iuxdFX (x). This is referred to as the Fourier-Stieltjes transform of FX.

  1. φX is uniformly continuous.

Sums of independent random variables

Let X and Y be independent r.v.s, and let

Z = X + Y.

Then φZ (u) = E[exp(iuz)] = E[exp(iuX + iuY )] = φX,Y (u, u).

But also

E[exp(iuX + iuY )] = E[exp(iuX) exp(iuY )] = φX (u)φY (u).

So φZ (u) = φX (u)φY (u)

If X and Y are continuous r.v.s, then so is Z.

fZ (z) = F−^1 [φZ (u)] = F−^1 [φX (z)φY (z)] = fX (z) ∗ fY (z)

by the convolution theorem. Thus, when continuous independent random variables are added, the p.d.f of the sum is the convolution of the p.d.f.s (and respectively p.m.f. for discrete independent r.v.s).

An example: Jointly Gaussian

If (X, Y ) ∼ N (μx, μy , σ^2 x, σ^2 y , ρ), then

φX,Y (u, v) = exp[i(uμx + vμy ) −

(u^2 σ x^2 + v^2 σ^2 y + 2uvρσxσy )]

We make an observation here: the “form” of the Gaussian p.d.f. is the exponential of quadratics. The form of the Fourier transform of the eponential of quadratics is of the form exponential of quadratics. This little fact gives rise to much of the analytical and practical usefulness of Gaussian r.v.s.

Characteristic functions marginals

We observe that φX,Y (u, 0) = φX (u)

In our Gaussian example, we have

φX (u) = φX,Y (u, 0) = exp(iuμx − σ^2 xu^2 /2)

which is the ch.f. for a Gaussian,

X ∼ N (μx, σ x^2 ).

We could, of course, have obtained a similar result via integration, but this is much easier.

Some important inequalities

In general, when we observe an outcome of a random variable, we “expect” it to be near the mean (that is, near the expected value). Further, the farther the outcome is from the mean, the less likely we expect the outcome to be. There are some very useful probabilities which quantize these intuitive “expectations.” These are the Markov inequality, and its

consequences, the Chebyshev inequality and the Chernoff bound. We will introduce these here. Let B ∈ B for a Borel set B. Recall that

IB (x) =

1 x ∈ B 0 x 6 ∈ B.

Let X be a random variable, and let Y = IB (X). This is a measurable function, so Y is another random variable.

E[Y ] = PX (B) = P (X ∈ B).

We will use this “expectation as probability” idea to get a bound. Suppose g is a nonnegative, nondecreasing function. Suppose b ∈ R with g(b) > 0. Consider the function

h(x) = g(x) g(b)

≥ 1 x ≥ b ≥ 0 ∀x.

Observe that h(x) ≥ I[b,∞)(x)

for all x, since

I[b,∞)(x) =

1 x ≥ b 0 x < b.

Note that h(b) = I[b,∞)(b). Now we have E[h(X)] ≥ E[I[b,∞)(X)] = P (X ≥ b).

Also,

E[h(X)] =

E[g(X)] g(b)

Thus

P (X ≥ b) ≤

E[g(X)] g(b)

A similar result can be established if g is nonnegative, nondecreasing on [0, ∞) and sym- metric about 0. We can thus establish that

P (|X| ≥ b) ≤ E[g(X)] g(b)

Special case: Assume X ≥ 0 , and let

g(x) =

x x ≥ 0 0 x < 0.

The inequality above gives rise to the Markov inequality :

P (X ≥ b) ≤

E[X]

b

for all b > 0. Somewhat more generally, the Markov inequality says

P (|X| ≥ b) ≤

E[|X|]

b

Cauchy-Schwartz inequality

This is an inequality that holds in any Hilbert space. It more or less forms the theme for the first several weeks of 6030.

Theorem 3 If E[X^2 ] < ∞ and E[Y 2 ] < ∞ then

|E[XY ]|^2 ≤ E[X^2 ]E[Y 2 ].

For example, | cov(X, Y )|^2 ≤ var(X) var(Y )

implying |ρ| ≤ 1. Observe that var(X) = E[X^2 ] − E[X]^2 ≥ 0 using Schwartz and Jensen inequalities.

Conditional Expectations and Distributions

Suppose X is a discrete r.v. Now we define the conditional distribution of another r.v. Y given X = xk (at some point where P (X = xk ) > 0 ) by

FY |X (y|xk ) = P (Y ≤ y|X = xk ) =

P (Y ≤ y, X = xk ) P (X = xk )

By the law of total probability,

FY (y) =

k

FY |X (y|xk )pX (xk ).

As we discussed before, when we condition on an event, we are shrinking the sample space under consideration. So there is some normalization that takes place. We also define E[Y |X = xk ] =

−∞

ydFY |X (y|xk ).

Note that this depends on the value of xk ; it is a function of xk. Let us now take the expectation with respect to X:

EX [E[Y |X = xk ]] =

k

E[Y |X = xk ]PX (xk ) = E[Y ].

We can think of E[Y |X = xk ] as a discrete random variable that is a function of X. For a discrete r.v. X, the function FY |X (y|xk ) could be either a discrete or a continuous r.v. Discrete: pY |X (y|xk ) = P (Y = y|X = xk )

Continuous: There exists a function fY |X such that

FY |X (y|xk ) =

∫ (^) y

−∞

fY |X (z|xk ) dz.

We can also write FY |X (y|xk ) = EI(−∞,y|X = xk ] If Y is discrete we have

pY |X (y|xk ) =

P (Y = y, X = xk ) px(xk )

pxy (xk , y) px(xk )

When X is a continuous r.v., conditional probabilities and expectations are somewhat more complicated, because P (X = xk) = 0 for any particular value of x. Recall that E[Y |Xk ] = g(xk ) for some function g, and E[g(x)] = E[Y ].

Definition 14 Suppose Y is an r.v. on the probability space (Ω, F, P ) with E[|Y |] < ∞. Then for A ∈ F, define (^) ∫

A

Y dP = E[IA(Y )].

2

Definition 15 Suppose X and Y are random variables and E[|Y |] < ∞. The conditional expectation of Y given X = x is any measurable function g(x) = E[Y |X = x] of x satisfying (^) ∫

B

E[Y |X = x]PX (dx) =

X−^1 (B)

Y dP (2)

for all B ∈ B, where X−^1 (B) = {ω ∈ Ω : X(ω) ∈ B}. 2

  1. It can be shown that under the stated conditions, such a function always exists.
  2. If X is discrete then E[Y |X = xk ] as defined earlier satisfies the property.
  3. E[Y |X = x] is unique, in the sense that if there are two functions g(x) and h(x) both satisfying (2) then P (g(x) = h(x)) = 1.

When a condition is true with probability 1, we say that it is true “almost surely,” or “a.s.” Once we have defined conditional expectation, we can define a conditional c.d.f.:

FY |X (y|x) = EI(∞,y|X = x].

Properties:

  1. This definition agrees with the previous one when X is discrete.
  2. FY (y) =

R FY^ |X^ (y|x)PX^ (dx).

  1. FY |X is a c.d.f. as a function of y because it satisfies all the properties of a c.d.f.
  2. If X and Y are jointly continuous then FY |X (y|x) has a density for every x,

fY |X (y|x) =

fXY (x, y) fX (x)

There is another interpretation:

FY |X (y|x) = lim ∆x→ 0 +^

P (Y ≤ y|x − ∆x/ 2 < X ≤ x + ∆x/2)

lim∆x→ 0 + P (Y ≤ y, x − ∆/ 2 < X ≤ x + ∆x/2)/∆x lim∆x→ 0 +^ P (x − ∆/ 2 < X ≤ x + ∆x/2)/∆x

∂x FXY (x, y) ∂ ∂x

FX (x)

∂x FXY (x, y) fx(x)

Definition 16 If X is an r.v., define σ(X) (the σ-field generated by X) to be

{{ω ∈ Ω : X(ω) ∈ B} for B ∈ B}.

2 Fact: A r.v. Y is measurable with respect to σ(X) if and only if there is a measurable function g : R → R such that Y = g(X). We now define conditional expectation with respect to a σ-field:

Definition 17 If X and Y are r.v.s with E[|Y |] < ∞, we define

E[Y |X] = E[Y |σ(X)]

2 Properties:

  1. By the fact stated above, we can write

E[Y |X] = g(x)

for some function g, g(x) = E[Y |X = x].

  1. E[Y ] = E[E[Y |X]]
  2. If Y itself is G-measurable, then E[Y |G] = Y.
  3. E[αY 1 + βY 2 |G] = αE[Y 1 |G] + βE[Y 2 |G].
  4. If Y ≥ 0 , then E[Y |G] ≥ 0.
  5. If E[|Y |] < ∞ and G ⊂ E ⊂ F, then

E[E[Y |E]] = E[Y |G].

Idea: If you first condition on a field that is less “course” than G you get a r.v. Then condition on G.

Definition 18 Two σ-fields G and H are independent if

P (GH) = P (G)P (H) for all G ∈ G and H ∈ H.

2 Note: X and Y are independent r.v.s iff σ(X) and σ(Y ) are independent σ-fields.

  1. If σ(Y ) is independent of G then E[Y |G] = E[Y ].
  2. If Y is G-measurable then E[Y |G] = E[Y ]. So, for example, if G = σ(X), and Y = g(X) for some g : R → R (that is, Y is G-measurable) then E[Y |X] = Y = g(X). More informally, E[g(x)|X] = g(x).