




























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Rey-Bellet; Class: ST-Lie Groups; Subject: Mathematics; University: University of Massachusetts - Amherst; Term: Fall 2007;
Typology: Study notes
1 / 36
This page cannot be seen from the preview
Don't miss anything!





























by the multiparameter analogue of the p.d.f. For example if there is a function fX : Rd^ → [0, ∞) such that
∫ · · ·
∫
A
fX(x 1 , · · · , xd)dx 1 · · · dxd
then X is a continuous random vector with p.d.f fX. Similarly a discrete random vector X taking values i = (i 1 , · · · , id) is described by
p(i 1 , · · · , id) = P (X 1 = i 1 , · · · Xd = id).
A collection of random variables X 1 , · · · , Xd are independent if
fX(x) = fX 1 (x 1 ) · · · fXd (xd) , continuous case pX(i) = pX 1 (i 1 ) · · · pXd (id) , discrete case (1.1)
If X is a random vector and g : Rd^ → R is a function then Y = g(X) is a real random variable. The mean or expectation of a real random variable X is defined by
{ ∫ (^) ∞ ∑−∞^ xfX^ (x)^ dx^ if^ X^ is continuous i∈S i pX^ (i)^ if^ X^ is discrete
More generally if Y = g(X) then
E[Y ] = E[g(X)] =
{ ∫ ∑^ Rd^ g(x)fX(x)^ dx^ if^ X^ is continuous i g(i)^ px(i)^ if^ X^ is discrete
The variance of a random variable X, denoted by var(X), is given by
var(X) = E
[ (X − E[X])^2
] = E[X^2 ] − E[X]^2.
The mean of a random variable X measures the average value of X while its variance is a measure of the spread of the distribution of X. Also commonly used is the standard
deviation
√ var(X). Let X and Y be two random variables then we have
E[X + Y ] = E[X] + E[Y ].
For the variance a simple computation shows that
var(X + Y ) = var(X) + 2cov(X, Y ) + var(Y )
where cov(X, Y ) is the covariance of X and Y and is defined by
cov(X, Y ) = E [(X − E[X])(Y − E[Y ])].
In particular if X and Y are independent then E[XY ] = E[X]E[Y ] and so cov(X, Y ) = 0 and thus var(X 1 + X 2 ) = var(X 1 ) + var(X 2 ). Another important and useful object is the moment generating function (m.g.f.) of a random variable X and is given by
MX (t) = E
[ etX^
] .
Whenever we use a m.g.f we will always assume that MX (t) is finite at least in an interval around 0. Note that this is not always the case. If the moment generating function of X is known then one compute all moments of X, i.e. E[Xn] by repeated differentiation of the function MX (t) with respect to t. The nth^ derivative of Mx(t) is given by
M (^) x(n )(t) =
[ XnetX^
]
and therefore E[Xn] = M (n)(0).
In particular E[X] = M (^) X′ (0) and var(X) = M (^) X′′ (0) − (M (^) X′ (0))^2. It is often very convenient to compute the mean and variance of X using these formulas (see the examples below). An important fact is the following (its proof is not that easy!)
Theorem 1.1.1 Let X and Y be two random variables and suppose that MX (t) = MY (t) for all t ∈ (−δ, δ) then X and Y have the same distribution.
Another important property of the m.g.f is
Proposition 1.1.2 If X and Y are independent random variable then the m.g.f of X + Y satisfies MX+Y (t) = MX (t)MY (t) ,
i.e., the m.g.f of a sum of independent random variable is the product of the m.g.f.
Proof: We have
E
[ et(X+Y^ )
] = E
[ etX^ etY^ )
] = E
[ etX^
] E
[ etY^ )
] ,
since etX^ and etY^ are independent.
The moment generating function is
E
[ etX^
∫ (^) b
a
etx^ dx =
etb^ − eta t(b − a)
and the mean and variance are
E[X] =
b − a 2
, var(X) =
(b − a)^2 12
We write X = U [a, b] to denote this random variable.
f (x) =
σ
2 π
e−^
(x−μ)^2 2 σ^2.
The moment generating function is (see below for a proof)
E
[ etX^
σ
2 π
∫ (^) ∞
−∞
etxe−^
(x−μ)^2 2 σ^2 dx = eμt+^ σ^2 t^2 (^2). (1.2)
and the mean and variance are
E[X] = μ , var(X) = σ^2.
We write X = N (μ, σ^2 ) to denote this random variable. The standard normal random variable is the normal random variable with μ = 0 and σ = 1, i.e., N (0, 1) The normal random variable has the following property X = N (0, 1) if and only if σX + μ = N (μ, σ^2 )
To see this one applies Proposition 1.2.1 (i) and (ii) and this tells us that the density of σX + μ is (^) σ^1 f (x−σ μ). To show that the formula for the moment generating function we consider first X = N (0, 1). Then by completing the square we have
MX (t) =
2 π
∫ (^) ∞
−∞
etxe−^
x 22 dx
2 π
∫ (^) ∞
−∞
e
t^2 (^2) e−^ (x−t)^2 (^2) dx
= e
t 22 1 √ 2 π
∫ (^) ∞
−∞
e−^
(x−t)^2 (^2) dx
= e
t 22 1 √ 2 π
∫ (^) ∞
−∞
e−^
y^2 (^2) dy
= e
t 22
(1.3)
This proves the formula for N (0, 1). Since N (μ, σ^2 ) = σN (0, 1) + μ, by Proposition
1.2.1, (iii) the moment generating function of N (μ, σ^2 ) is etμe σ^22 t 2 as claimed.
f (x) =
{ λe−λx^ if x > 0 0 otherwise
The moment generating function is
[ etX^
] = λ
∫ (^) ∞
0
etxe−λx^ =
{ (^) λ λ−t if^ λ < t +∞ otherwise
and the mean and variance are
λ
, var(X) =
λ^2
We write X = Exp(λ) to denote this random variable. This random variable will play an important role in the construction of continuous-time Markov chains. It often has the interpretation of a waiting time until the occurrence of an event.
f (x) =
{ λe−λx^ (λx)
n− 1 (n−1)! if^ x >^0 0 otherwise
The moment generating function is
[ etX^
] = λ
∫ (^) ∞
0
etxλe−λx^
(λx)n−^1 (n − 1)!
{ (^ λ λ−t
)n if t < λ +∞ otherwise
and the mean and variance are
E[X] =
n λ
, var(X) =
n λ^2
We write X = Gamma(n, λ) to denote this random variable. To compute the m.g.f note that for any α > 0 ∫ (^) ∞
0
e−αx^ dx =
α
and the mean and the variance are
E[X] = np , var(X) = np(1 − p).
We write X = B(n, p) to denote this random variable. The formula for the m.g.f can be obtained directly using the binomial theorem, or simply by noting that by construction B(n, p) is a sum of n independent Bernoulli random variables.
p(n) = (1 − p)n−^1 p , n = 1, 2 , 3 , · · ·
The moment generating function is
[ etX^
∑^ ∞
n=
etn(1 − p)n−^1 p =
{ (^) pet 1 −et(p−1) if^ e
t(p − 1) < 1 0 otherwise
The mean and the variance are
p
, var(X) =
1 − p p^2
We write X = Geometric(p) to denote this random variable.
p(n) = e−λ^
λn n!
n = 0, 1 , 2 , · · ·.
The moment generating function is
[ etX^
∑^ ∞
n=
etn^
λn n!
e−λ^ = eλ(e t−1) .
The mean and the variance are
E[X] = λ , var(X) = λ.
We write X = P oisson(λ) to denote this random variable.
In this section we discuss a few techniques to simulate a given random variable on a computer. The first step which is built-in in any computer is the simulation of a random number, i.e., the simulation of a uniform random variable U ([0, 1]), rounded off to the nearest (^101) n. In principle this is not difficult: take ten slips of paper numbered 0, 1 , · · · , 9, place them in a hat and select successively n slips, with replacement, from the hat. The sequence of digits obtained (with a decimal point in front) is the value of a uniform random variable rounded off to the nearest (^101) n. In pre-computer times, tables of random numbers were produced in that way and still can be found. This is of course not the way a actual computer generates a random number. A computer will usually generates a random number by using a deterministic algorithm which produce a pseudo random number which ”looks like” a random number For example choose positive integers a, c and m and set
Xn+1 = (aXn + c) mod(m).
The number Xn is either 0, 1 , · · · , m − 1 and the quantity Xn/m is taken to be an approximation of a uniform random variable. One can show that for suitable a, C and m this is a good approximation. This algorithm is just one of many possibles and used in practice. The issue of actually generating a good random number is a nice, interesting, and classical problem in computer sciences. For our purpose we will simply content ourselves with assuming that there is a ”black box” in your computer which generates U ([0, 1]) in a satisfying manner. We start with a very easy example, namely simulating a discrete random variable X.
Algorithm 1.3.1 (Discrete random variable) Let X be a discrete random variable taking the values x 1 , x 2 , · · · with p.d.f. p(j) = P {X = xj }. To simulate X,
x 1 if U < p(1) x 2 if p(1) < U < p(1) + p(2) .. .
xn if p(1) + · · · + p(n − 1) < U < p(1) + · · · p(n) .. .
Then X has the desired distribution.
We discuss next two general methods simulating continuous random variable. The first is called the inverse transformation method and is based on the following
Algorithm 1.3.5 (Rejection method for continuous random variable). Let X be a random variable with p.d.f f (x) and let Y be a random variable with p.d.f g(x). Furthermore assume that there exists a constant C such that
f (y) g(y)
≤ C , for all y.
To simulate X
f (Y ) g(Y )C set X = Y. Otherwise return to Step 1.
That the algorithm does the job is the object of the following proposition.
Proposition 1.3.6 The random variable X generated by the rejection method has p.d.f f (x).
Proof: To obtain a value of X we will need in general to iterate the algorithm a random number of times We generate random variables Y 1 , · · · , YN until YN is accepted and then set X = YN. We need to verify that the p.d.f of X is actually f (x). Then we have
P (X ≤ x) = P (YN ≤ x)
= P
( Y ≤ x | U ≤
f (Y ) Cg(Y )
)
( Y ≤ x , U ≤ (^) Cgf^ (Y(Y^ ) )
)
( U ≤ (^) Cgf^ ((YY^ ) )
)
∫ (^) ∞ −∞ P^
( Y ≤ x , U ≤ (^) Cgf^ ((YY^ ) ) | Y = y
) g(y) dy P
( U ≤ (^) Cgf^ (Y(Y^ ) )
)
∫ (^) x −∞ P^
( U ≤ (^) Cgf^ (y(y))
) g(y) dy P
( U ≤ (^) Cgf^ (Y(Y^ ) )
)
∫ (^) x −∞
f (y) Cg(y) g(y)^ dy P
( U ≤ (^) Cgf^ ((YY^ ) )
)
∫ (^) x −∞ f^ (y)^ dy CP
( U ≤ (^) Cgf^ (Y(Y^ ) )
If we let x → ∞ we obtain that CP
( U ≤ (^) Cgf^ ((YY^ ) )
) = 1 and thus
P (X ≤ x) =
∫ (^) x
−∞
f (x) dx.
and this shows that X has p.d.f f (x).
In order to decide whether this method is efficient of not, we need to ensure that rejections occur with small probability. The above proof shows that at each iteration the probability that the results is accepted is
( U ≤
f (Y ) Cg(Y )
independently of the other iterations. Therefore the number of iterations needed is Geom( (^) C^1 ) with mean C. Therefore the ability to choose a reasonably small C will ensure that the method is efficient.
Example 1.3.7 Let X be the random variable with p.d.f
f (x) = 20(1 − x)^3 , 0 < x < 1.
Since the p.d.f. is concentrated on [0, 1] let us take
g(x) = 1 0 < x < 1.
To determine C such that f (x)/g(x) ≤ C we need to maximize the function h(x) ≡ f (x)/g(x) = 20x(1 − x)^3. Differentiating gives h′(x) = 20 ((1 − x)^3 − 3 x(1 − x)^2 ) and thus the maximum is attained at x = 1/4. Thus
f (x) g(x)
We obtain f (x) Cg(x)
x(1 − x)^3
and the rejection method is
The average number of accepted iterations is 135/64.
Algorithm 1.3.10 (Geometric random variable)
Then X = Geom(p).
Example 1.3.11 (Simulating the Gamma random variable)Using the fact that Gamma(n, λ) is a sum of n independent Exp(λ) one immediately obtain
Algorithm 1.3.12 (Gamma random variable)
Then X = Gamma(n, p).
Finally we give an elegant algorithm which generates 2 independent normal random variables.
Example 1.3.13 (Simulating a normal random variable: Box-M¨uller)We show a simple way to generate 2 independent standard normal random variables X and Y. The joint p.d.f. of X and Y is given by
f (x, y) =
2 π
e−^
(x^2 +y^2 ) (^2).
Let us change into polar coordinates (r, θ) with r^2 = x^2 + y^2 and tan(θ) = y/x. The change of variables formula gives
f (x, y) dxdy = re−^
r 22 dr
2 θ
dθ.
Consider further the change of variables set s = r^2 so that
f (x, y) dxdy =
e−^ 2 s ds
2 θ
dθ.
The right-hand side is iasily seen to be the joint p.d.f of the two independent random variables S = Exp(1/2) and Θ = U ([0, 2 π]). Therefore we obtain
Algorithm 1.3.14 (Standard normal random variable)
X =
√ −2 log(U 1 ) cos(2πU 2 )
Y =
√ −2 log(U 1 ) sin(2πU 2 ) (1.4) (1.5)
Then X and Y are 2 independent N (0, 1).
We start by deriving simple techniques for bounding the tail distribution of a random variable, i.e., bounding the probability that the random variable takes value far from the its mean. Our first inequality, called Markov’s inequality simply assumes that we know the mean of X.
Proposition 1.4.1 (Markov’s Inequality) Let X be a random variable which as- sumes only nonnegative values, i.e. P (X ≥ 0) = 1. Then for any a > 0 we have
P (X ≥ a) ≤
a
Proof: For a > 0 let us define the random variable
Ia =
{ 1 if X ≥ a 0 otherwise
Note that, since X ≥ 0 we have
Ia ≤
a
and that since Ia is a binomial random variable
E[Ia] = P (X ≥ a).
Taking expectations in the inequality (1.6) gives
P (X ≥ a) = E[Ia] ≤ E
a
a
This is significantly better that the bound provided by Markov’s inequality! Note also that we can do a bit better by noting that the distribution of Sn is symmetric around its mean and thus we can replace 4/n by 2/n.
We can better if we know all moments of the random variable X, for example if we know the moment generating function MX (t) of the random variable X. We have
Proposition 1.4.5 (Chernov’s bounds) Let X be a random variable with moment generating function MX (t) = E[etX^ ].
P (X ≥ a) ≤ min t≥ 0
E[etX^ ] eta^
P (X ≤ a) ≤ min t< 0
E[etX^ ] eta^
Proof: This follows from Markov inequality. For t > 0 we have
P (X ≥ a) = P (etX^ > eta) ≤
E[etX^ ] eta^
Since t > 0 is arbitrary we obtain
P (X ≥ a) ≤ min t≥ 0
E[etX^ ] eta^
Similarly for t < 0 we have
P (X ≤ a) = P (etX^ > eta) ≤
E[etX^ ] eta^
and thus
P (X ≥ a) ≤ min t≤ 0
E[etX^ ] eta^
Let us consider again our flipping coin examples
Example 1.4.6 (Flipping coins, cont’d) Since Sn is a binomial B(n, 12 ) random variable its moment generating function is given by MSn (t) = (^12 + 12 et)n. To estimate P (Sn ≥ 3 n/4) we apply Chernov bound with t > 0 and obtain
( Sn ≥
3 n 4
) ≤
(^12 + 12 et)n e
3 nt 4 =
( 1 2
e−^
3 t (^4) +^1 2
e
t 4
)n .
To find the optimal bound we minimize the function f (t) = 12 e−^ 34 t
. The mimimum is at t = log 3 and
f (log(3)) =
(e−^
(^34) log(3)
(^14) log(3) ) =
e
(^14) log(3) (e−^ log 3^ + 1) =
(^14) ' 0. 877
and thus we obtain
P
( Sn ≥
3 n 4
) ≤ 0. 877 n^.
This is course much better than 2/n. For n = 100 Chebyshev inequality tells us that the probability to obtain 75 heads is not bigger than 0.02 while the Chernov bounds tells us that it is actually not greater than 2. 09 × 10 −^6.
In this section we study the behavior, for large n of a sum of independent identically distributed variables. That is let X 1 , X 2 , · · · be a sequence of independent random variables where all Xi’s have the same distribution. Then we denote by Sn the sum
Sn = X 1 + · · · + Xn.
Under suitable conditions Sn will exhibit a universal behavior which does not depend on all the details of the distribution of the Xi’s but only on a few of its charcteristics, like the mean or the variance. The first result is the weak law of large numbers. It tells us that if we perform a large number of independent trials the average value of our trials is close to the mean with probability close to 1. The proof is not very difficult, but it is a very important result!
Theorem 1.5.1 (The weak Law of Large Numbers) Let X 1 , X 2 , · · · be a sequence of independent identically distributed random variables with mean μ and variance σ^2. Let Sn = X 1 + · · · + Xn
Then for any > 0
nlim→∞ P
(∣∣ ∣∣^ Sn n
− μ
∣∣ ∣∣ ≥
) = 0.
Proof: : By the linearity of expectation we have
[ Sn n
n
E[X 1 + · · · + Xn] =
nμ n
= μ.