






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The definition of probability and independence in the context of set theory. It defines sample space, event, and sigma algebra. It also introduces probability axioms and probability measures. examples and theorems related to probability.
Typology: Lecture notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!







To define probability, we rely on set theory we learned in the first chapter of this course. In particular, we consider an experiment (or trial) and its result called an outcome. Tossing a coin is a simple experiment anyone can do, but more complicated phenomena such as elections can also be considered as an experiment.
Definition 1 The set of all possible outcomes of an experiment is called the sample space of the experiment and denoted by Ω. Any subset of Ω is called an event.
That is, an event is any collection of possible outcomes of an experiment.
Definition 2 A collection of subsets of Ω is called a sigma algebra (or sigma field) and denoted by F if it satisfies the following properties
n=1 An^ ∈ F.
The definition implies that
n=1 An^ is a sigma algebra and that^ ∅^ and Ω belong to any sigma algebra (why?). Given a particular Ω, we have many sigma algebras the smallest of which is {∅, Ω} and called a trivial sigma algebra. To sum up, any experiment is associated with a pair (Ω, F) called a measurable space; where Ω is the set of all possible outcomes and F contains all events whose occurrence we are interested. An example may help understand these concepts.
Example 1 What is the sample space of flipping a fair coin twice? What is the sigma algebra which consists of all subsets of the sample space?
Now, we are ready to define probability. The following is called Probability Axiom (or Kolmogorov’s Axiom.
Axiom 1 Given an experiment with a measurable space (Ω, F), a probability measure P : F 7 → [0, 1] is a function satisfying
P
n=
An
n=
P (An)
(Ω, F, P ) is called a probability space.
Given an event A, if P (A) = 0, then A is called null. If P (A) = 1, we say A occurs almost surely. Note that null events are not necessarily impossible. In fact, they occur all the time (why?).
Example 2 What is the probability that we get heads twice in the experiment of Example 1?
We derive some familiar (and maybe unfamiliar) properties of probability.
Theorem 1 (Probability) Let P be a probability measure and A, B ∈ F.
Example 3 P (A ∩ B) ≥ P (A) + P (B) is a special case of Bonferroni’s inequality and can be used to bound the probability of a simultaneous event is unknown but the probabilities of the individual events are known.
We can extend these properties to a sequence of sets.
Theorem 2 (Probability and Sequence of Sets) Let P be a probability measure.
P (A) = lim n→∞ P (An)
n=
An
n=
P (An)
Next, we study the conditional probability and independence.
Definition 3 If P (B) > 0 , then the conditional probability that A occurs given that B occurs is defined as
P (A | B) =
The conditional probability can be very tricky as the following example shows.
Example 4 A couple is expecting twins.
2 Random Variables and Probability Distributions
Often, we are more interested in some consequences of experiments than experiments themselves. For example, a gambler is more interested in how much they win or lose than the games they play. Formally, this is a function which maps the sample space into R or its subset.
Definition 5 A random variable is a function X : Ω 7 → R satisfying A(x) = {ω ∈ Ω : X(ω) ≤ x} ∈ F for all x ∈ R. Such a function is said to be F-measurable.
After an experiment is done, the outcome ω ∈ Ω is revealed and a random variable X takes some value. The distribution function of a random variable describes how likely it is for X to take a particular value.
Definition 6 The distribution function of a random variable X is the function F : R 7 → [0, 1] given by F (x) = P (A(x)) where A(x) = {ω ∈ Ω : X(ω) ≤ x} or equivalently F (x) = P (X ≤ x).
Now, we understand why the technical condition A(x) ∈ F in Definition 5 was necessary. We sometimes write FX and PX in order to emphasize these functions are defined for the random variable X. The two random variables, X and Y , are said to be distributed identically if P (X ∈ A) = P (Y ∈ A) for any A ∈ F. This implies in turn that FX (x) = FY (x) for any x. Finally, a distribution function has the following properties.
Theorem 6 (Distribution Function) A distribution function F (x) of a random variable X sat- isfies the following properties.
Given this theorem, one can prove the following: P (X > x) = 1 − F (x), P (x < X ≤ y) = F (y) − F (x), and P (X = x) = F (x) − limy↑x F (y).
Example 6 Two examples of random variable and its distribution function.
One can classify random variables into two classes based on the probability function.
Definition 7 Let X be a random variable.
F (x) =
∫ (^) x
−∞
f (t) dt
for x ∈ R for some integrable function f : R 7 → [0, ∞) called the probability density function.
We may write fX (x) to stress the role of X. For discrete distributions, the distribution function and probability mass function are related by
F (x) =
xn≤x
f (xn) and f (x) = F (x) − lim y↑x
f (y)
The mass function has the following property
f (xn) = 1.
Example 7 Two examples of discrete random variables. What real world phenomena can we model using these random variables?
f (x) =
n x
px(1 − p)n−x
f (x) = λx x!
e−λ
For continuous distributions, if F is differentiable at x the fundamental theorem of calculus implies f (x) = F ′(x). The density function has the following properties:
−∞ f^ (x)^ dx^ = 1,^ P^ (X^ =^ x) = 0 for all x ∈ R, and P (a ≤ X ≤ b) =
∫ (^) b a f^ (x)^ dx.
Example 8 Five examples of continuous distributions.
e−βx^ where Γ(α) =
0
xα−^1 e−t^ dt
The exponential distribution is a special case of the Gamma distribution with α = 1, i.e., f (x) = βe−βx. This distribution has “memoryless” property. Another important special case occurs when α = p/ 2 and β = 1/ 2 , and is called χ^2 distribution with p degrees of freedom.
f (x) =
(2π)n|Σ|
exp
(x − μ)>Σ−^1 (x − μ)
where μ is an n × 1 vector of mean and Σ is an n × n positive definite covariance matrix.
In addition to joint and marginal distributions, conditional distributions are often of interest.
Definition 9 Let X and Y be random variables with marginal probability mass (density) functions, fX (x) and fY (y), and joint probability mass (density) functions, f (x, y).
f (y | x) = f (x, y) fX (x)
If X and Y are not independent, the direct way to obtain the marginal density of X from the joint density of (X, Y ) is to integrate out Y. That is, fX (x) =
−∞ f^ (x, y)^ dy. If^ Y^ is a discrete random variable, one needs to sum over Y. That is, fX (x) =
y∈R f^ (x, y). We end this section with the following examples.
Example 10 Consider two multivariate random variables defined in Example 9.
3 Expectations and Functions of Random Variables
We have studied the behavior of random variables. In this section, we are concerned about their expectation (or mean value, expected value).
Definition 10 Let X be a discrete (continuous) random variables with probability mass (density) function f. Then, the expectation of X is defined as
x xf^ (x)^ if^ X^ is discrete.
−∞ xf^ (x)^ dx^ if^ X^ is continuous.
If E(|X|) = ∞, then we say the expectation E(X) does not exist. One sometimes write EX to emphasize that the expectation is taken with respect to a particular random variable X. The ex- pectation has the following properties, all of which follow directly from the properties of summation and integral. In particular, the expectation is a linear operator (e.g., 1/E(X) 6 = E(1/X)).
Theorem 7 (Expectation) Let X and Y be random variables with probability mass (density) function fX and fY , respectively. Assume that their expectations exist, and let g be any function.
x g(x)fX^ (x)^ if^ X^ is discrete, and^ E[g(X)] =^
−∞ g(x)fX^ (x)^ dx^ if^ X^ is contin- uous.
Example 11 Calculate the expectations of the following random variables if they exist.
r + x − 1 x
pr(1 − p)x
where p is the probability of a success. A special case of negative binomial distribution when r = 1 is geometric distribution. Note that geometric distribution has the memoryless property that we studied for the exponential distribution: i.e, P (X > s | X > t) = P (X > s − t) for s > t. You should be able to prove this by now.
f (x) =
π[1 + (x − μ)^2 ]
Theorem 13 (H¨older’s Inequality) Let X and Y be random variables and p, q > 1 satisfying 1 /p + 1/q = 1. Then, |E(XY )| ≤ E(|XY |) ≤ [E(|X|p)]^1 /p[E(|Y |q)]^1 /q. When p = q = 2, the inequality is called Cauchy-Schwartz inequality. When Y = 1, it is Liapounov’s inequality.
Theorem 14 (Chebychev’s Inequality) Let X be a random variable and let g(x) be a nonneg- ative function. Then, for any > 0 ,
P (g(X) ≥ ) ≤ E[g(X)]
Example 13 Let X be any random variable with mean μ and variance σ^2. Use Chebychev’s inequality to show that P (|X − μ| ≥ 2 σ) ≤ 1 / 4.
Next, we study the functions of random variables and their distributions. For example, we want to answer the questions, what is the distribution of Y = g(X) given a random variable X with distribution function F (x)?
Theorem 15 (Transformation of Univariate Random Variables) Let X be a continuous ran- dom variable with probability function FX and probability density function fX (x). Define Y = g(X) where g is a monotone function. Let also X and Y denote the support of distributions for X and Y , respectively.
fY (y) =
fX (g−^1 (y))
∣ (^) dyd g−^1 (y)
∣ for^ y^ ∈ Y 0 otherwise
Example 14 Derive the probability function for the following transformations of a random vari- able.
This rule can be generalized to the transformation of multivariate random variables.
Theorem 16 (Transformation of Bivariate Random Variables) Let (X, Y ) be a vector of two continuous random variables with joint probability density function fX,Y (x, y). Consider a bijective transformation U = g 1 (X, Y ) and V = g 2 (X, Y ). Define the inverse of this transformation as X = h 1 (U, V ) and Y = h 2 (U, V ). Then, the joint probability density function for (U, V ) is given by fU,V (u, v) = fX,Y (h 1 (u, v), h 2 (u, v))|J|
where J =
dx du
dx dy dv du
dy dv
∣∣ is called Jacobian.
Example 15 Derive the distribution of the following random variables.