POL502: Probability, Lecture notes of Probability and Statistics

The definition of probability and independence in the context of set theory. It defines sample space, event, and sigma algebra. It also introduces probability axioms and probability measures. examples and theorems related to probability.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

esha
esha 🇺🇸

3

(1)

224 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
POL502: Probability
Kosuke Imai
Department of Politics, Princeton University
December 12, 2003
1 Probability and Independence
To define probability, we rely on set theory we learned in the first chapter of this course. In
particular, we consider an experiment (or trial) and its result called an outcome. Tossing a coin is
a simple experiment anyone can do, but more complicated phenomena such as elections can also
be considered as an experiment.
Definition 1 The set of al l possible outcomes of an experiment is called the sample space of the
experiment and denoted by . Any subset of is called an event.
That is, an event is any collection of possible outcomes of an experiment.
Definition 2 A collection of subsets of is called a sigma algebra (or sigma field) and denoted
by Fif it satisfies the following properties
1. If {An}
n=1 is a sequence of sets such that An F for any nN, then S
n=1 An F.
2. If A F, then AC F.
3. F.
The definition implies that T
n=1 Anis a sigma algebra and that and belong to any sigma
algebra (why?). Given a particular Ω, we have many sigma algebras the smallest of which is {∅,}
and called a trivial sigma algebra. To sum up, any experiment is associated with a pair (Ω,F)
called a measurable space; where is the set of all possible outcomes and Fcontains all events
whose occurrence we are interested. An example may help understand these concepts.
Example 1 What is the sample space of flipping a fair coin twice? What is the sigma algebra
which consists of all subsets of the sample space?
Now, we are ready to define probability. The following is called Probability Axiom (or Kolmogorov’s
Axiom.
Axiom 1 Given an experiment with a measurable space (Ω,F), a probability measure P:F 7→
[0,1] is a function satisfying
1. P() = 0
2. P(Ω) = 1
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download POL502: Probability and more Lecture notes Probability and Statistics in PDF only on Docsity!

POL502: Probability

Kosuke Imai

Department of Politics, Princeton University

December 12, 2003

1 Probability and Independence

To define probability, we rely on set theory we learned in the first chapter of this course. In particular, we consider an experiment (or trial) and its result called an outcome. Tossing a coin is a simple experiment anyone can do, but more complicated phenomena such as elections can also be considered as an experiment.

Definition 1 The set of all possible outcomes of an experiment is called the sample space of the experiment and denoted by Ω. Any subset of Ω is called an event.

That is, an event is any collection of possible outcomes of an experiment.

Definition 2 A collection of subsets of Ω is called a sigma algebra (or sigma field) and denoted by F if it satisfies the following properties

  1. If {An}∞ n=1 is a sequence of sets such that An ∈ F for any n ∈ N, then

n=1 An^ ∈ F.

  1. If A ∈ F, then AC^ ∈ F.
  2. ∅ ∈ F.

The definition implies that

n=1 An^ is a sigma algebra and that^ ∅^ and Ω belong to any sigma algebra (why?). Given a particular Ω, we have many sigma algebras the smallest of which is {∅, Ω} and called a trivial sigma algebra. To sum up, any experiment is associated with a pair (Ω, F) called a measurable space; where Ω is the set of all possible outcomes and F contains all events whose occurrence we are interested. An example may help understand these concepts.

Example 1 What is the sample space of flipping a fair coin twice? What is the sigma algebra which consists of all subsets of the sample space?

Now, we are ready to define probability. The following is called Probability Axiom (or Kolmogorov’s Axiom.

Axiom 1 Given an experiment with a measurable space (Ω, F), a probability measure P : F 7 → [0, 1] is a function satisfying

  1. P (∅) = 0
  2. P (Ω) = 1
  1. If A 1 , A 2 ,... ∈ F is a collection of disjoint sets (i.e., An ∩ Am = ∅ for all pairs of n and m with n 6 = m, then

P

n=

An

∑^ ∞

n=

P (An)

(Ω, F, P ) is called a probability space.

Given an event A, if P (A) = 0, then A is called null. If P (A) = 1, we say A occurs almost surely. Note that null events are not necessarily impossible. In fact, they occur all the time (why?).

Example 2 What is the probability that we get heads twice in the experiment of Example 1?

We derive some familiar (and maybe unfamiliar) properties of probability.

Theorem 1 (Probability) Let P be a probability measure and A, B ∈ F.

  1. P (AC^ ) = 1 − P (A).
  2. If A ⊂ B, then P (B) = P (A) + P (B \ A) ≥ P (A).
  3. P (AC^ ∩ B) = P (B) − P (A ∩ B).
  4. P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Example 3 P (A ∩ B) ≥ P (A) + P (B) is a special case of Bonferroni’s inequality and can be used to bound the probability of a simultaneous event is unknown but the probabilities of the individual events are known.

We can extend these properties to a sequence of sets.

Theorem 2 (Probability and Sequence of Sets) Let P be a probability measure.

  1. If {An}∞ n=1 is an increasing (decreasing) sequence of events such that A 1 ⊂ A 2 ⊂... (A 1 ⊃ A 2 ⊃.. .), then for A = limn→∞ An we have

P (A) = lim n→∞ P (An)

  1. (Boole’s Inequality) If {An}∞ n=1 be a sequence of sets, then

P

n=

An

∑^ ∞

n=

P (An)

Next, we study the conditional probability and independence.

Definition 3 If P (B) > 0 , then the conditional probability that A occurs given that B occurs is defined as

P (A | B) =

P (A ∩ B)

P (B)

The conditional probability can be very tricky as the following example shows.

Example 4 A couple is expecting twins.

2 Random Variables and Probability Distributions

Often, we are more interested in some consequences of experiments than experiments themselves. For example, a gambler is more interested in how much they win or lose than the games they play. Formally, this is a function which maps the sample space into R or its subset.

Definition 5 A random variable is a function X : Ω 7 → R satisfying A(x) = {ω ∈ Ω : X(ω) ≤ x} ∈ F for all x ∈ R. Such a function is said to be F-measurable.

After an experiment is done, the outcome ω ∈ Ω is revealed and a random variable X takes some value. The distribution function of a random variable describes how likely it is for X to take a particular value.

Definition 6 The distribution function of a random variable X is the function F : R 7 → [0, 1] given by F (x) = P (A(x)) where A(x) = {ω ∈ Ω : X(ω) ≤ x} or equivalently F (x) = P (X ≤ x).

Now, we understand why the technical condition A(x) ∈ F in Definition 5 was necessary. We sometimes write FX and PX in order to emphasize these functions are defined for the random variable X. The two random variables, X and Y , are said to be distributed identically if P (X ∈ A) = P (Y ∈ A) for any A ∈ F. This implies in turn that FX (x) = FY (x) for any x. Finally, a distribution function has the following properties.

Theorem 6 (Distribution Function) A distribution function F (x) of a random variable X sat- isfies the following properties.

  1. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1
  2. If x < y, then F (x) ≤ F (y).
  3. F is right-continuous: that is, limx↓c F (x) = F (c) for any c ∈ R.

Given this theorem, one can prove the following: P (X > x) = 1 − F (x), P (x < X ≤ y) = F (y) − F (x), and P (X = x) = F (x) − limy↑x F (y).

Example 6 Two examples of random variable and its distribution function.

  1. Bernoulli, Geometric. In a coin toss experiment, a Bernoulli random variable can be defined as X(head) = 1 and X(tail) = 0. What is the distribution function? What about the distribution function of a random variable which represents the number of tosses required to get a head?
  2. Logistic. A special case of the logistic distribution is given by F (x) = (^) 1+^1 e−x. Confirm that this satisfies Theorem 6.

One can classify random variables into two classes based on the probability function.

Definition 7 Let X be a random variable.

  1. X is said to be discrete if it takes values in a countable subset {x 1 , x 2 ,.. .} of R. The discrete random variable has probability mass function f : R 7 → [0, 1] given by f (x) = P (X = x).
  1. X is said to be continuous if its distribution function can be expressed as

F (x) =

∫ (^) x

−∞

f (t) dt

for x ∈ R for some integrable function f : R 7 → [0, ∞) called the probability density function.

We may write fX (x) to stress the role of X. For discrete distributions, the distribution function and probability mass function are related by

F (x) =

xn≤x

f (xn) and f (x) = F (x) − lim y↑x

f (y)

The mass function has the following property

f (xn) = 1.

Example 7 Two examples of discrete random variables. What real world phenomena can we model using these random variables?

  1. Binomial. The sum of n identically distributed Bernoulli random variables with probability of success p is a Binomial random variable, which takes the values in the set { 0 , 1 , 2 ,... , n}. The probability mass function with parameters p and n is

f (x) =

n x

px(1 − p)n−x

  1. Poisson. A Poisson random variable X takes values in the set { 0 , 1 , 2 ,.. .} with the proba- bility mass function and the parameter λ

f (x) = λx x!

e−λ

  1. Poisson approximation to Binomial. Show that if n is large and p is small, Poisson pmf can approximate Binomial pmf.

For continuous distributions, if F is differentiable at x the fundamental theorem of calculus implies f (x) = F ′(x). The density function has the following properties:

−∞ f^ (x)^ dx^ = 1,^ P^ (X^ =^ x) = 0 for all x ∈ R, and P (a ≤ X ≤ b) =

∫ (^) b a f^ (x)^ dx.

Example 8 Five examples of continuous distributions.

  1. Gamma, Exponential, χ^2. A gamma random variable takes non-negative values and has the following density function with the parameters α > 0 (shape parameter), β > 0 (scale parameter), f (x) = βαxα−^1 Γ(α)

e−βx^ where Γ(α) =

0

xα−^1 e−t^ dt

The exponential distribution is a special case of the Gamma distribution with α = 1, i.e., f (x) = βe−βx. This distribution has “memoryless” property. Another important special case occurs when α = p/ 2 and β = 1/ 2 , and is called χ^2 distribution with p degrees of freedom.

  1. Multivariate Normal. An n-dimensional multivariate normal random vector X = (X 1 ,... , Xn) with the following density function

f (x) =

(2π)n|Σ|

exp

[

(x − μ)>Σ−^1 (x − μ)

]

where μ is an n × 1 vector of mean and Σ is an n × n positive definite covariance matrix.

In addition to joint and marginal distributions, conditional distributions are often of interest.

Definition 9 Let X and Y be random variables with marginal probability mass (density) functions, fX (x) and fY (y), and joint probability mass (density) functions, f (x, y).

  1. The conditional mass (density) function of Y given X is defined by

f (y | x) = f (x, y) fX (x)

  1. X and Y are said to be independent if f (x, y) = fX (x)fY (y).

If X and Y are not independent, the direct way to obtain the marginal density of X from the joint density of (X, Y ) is to integrate out Y. That is, fX (x) =

−∞ f^ (x, y)^ dy. If^ Y^ is a discrete random variable, one needs to sum over Y. That is, fX (x) =

y∈R f^ (x, y). We end this section with the following examples.

Example 10 Consider two multivariate random variables defined in Example 9.

  1. Let (X 1 , X 2 ,... , Xn) be a multinomial random vector. Show that the marginal distribution of xi for any i ∈ { 1 ,... , n} is a Binomial distribution. Also, show that (X 1 , X 2 ,... , Xn− 1 ) conditional on Xn = xn follows a multinomial distribution.
  2. Rewrite the bivariate Normal density function using means μ 1 , μ 2 , variances σ 1 , σ 2 and the correlation ρ.

3 Expectations and Functions of Random Variables

We have studied the behavior of random variables. In this section, we are concerned about their expectation (or mean value, expected value).

Definition 10 Let X be a discrete (continuous) random variables with probability mass (density) function f. Then, the expectation of X is defined as

  1. E(X) =

x xf^ (x)^ if^ X^ is discrete.

  1. E(X) =

−∞ xf^ (x)^ dx^ if^ X^ is continuous.

If E(|X|) = ∞, then we say the expectation E(X) does not exist. One sometimes write EX to emphasize that the expectation is taken with respect to a particular random variable X. The ex- pectation has the following properties, all of which follow directly from the properties of summation and integral. In particular, the expectation is a linear operator (e.g., 1/E(X) 6 = E(1/X)).

Theorem 7 (Expectation) Let X and Y be random variables with probability mass (density) function fX and fY , respectively. Assume that their expectations exist, and let g be any function.

  1. E[g(X)] =

x g(x)fX^ (x)^ if^ X^ is discrete, and^ E[g(X)] =^

−∞ g(x)fX^ (x)^ dx^ if^ X^ is contin- uous.

  1. If g 1 (x) ≥ g 2 (x) for all x and any functions g 1 and g 2 , then E[g 1 (X)] ≥ E[g 2 (X)].
  2. E(aX + bY ) = aE(X) + bE(Y ) for any a, b ∈ R, and in particular E(a) = a.

Example 11 Calculate the expectations of the following random variables if they exist.

  1. Bernoulli random variable.
  2. Binomial random variable.
  3. Poisson random variable.
  4. Negative binomial. A negative binomial random variable is defined as the number of failures of Bernoulli trials that is required to obtain r successes. Its probability mass function is f (x) =

r + x − 1 x

pr(1 − p)x

where p is the probability of a success. A special case of negative binomial distribution when r = 1 is geometric distribution. Note that geometric distribution has the memoryless property that we studied for the exponential distribution: i.e, P (X > s | X > t) = P (X > s − t) for s > t. You should be able to prove this by now.

  1. Gamma random variable.
  2. Beta random variable.
  3. Normal random variable.
  4. Cauchy. A Cauchy random variable takes a value in (−∞, ∞) with the following symmetric and bell-shaped density function.

f (x) =

π[1 + (x − μ)^2 ]

Theorem 13 (H¨older’s Inequality) Let X and Y be random variables and p, q > 1 satisfying 1 /p + 1/q = 1. Then, |E(XY )| ≤ E(|XY |) ≤ [E(|X|p)]^1 /p[E(|Y |q)]^1 /q. When p = q = 2, the inequality is called Cauchy-Schwartz inequality. When Y = 1, it is Liapounov’s inequality.

Theorem 14 (Chebychev’s Inequality) Let X be a random variable and let g(x) be a nonneg- ative function. Then, for any  > 0 ,

P (g(X) ≥ ) ≤ E[g(X)] 

Example 13 Let X be any random variable with mean μ and variance σ^2. Use Chebychev’s inequality to show that P (|X − μ| ≥ 2 σ) ≤ 1 / 4.

Next, we study the functions of random variables and their distributions. For example, we want to answer the questions, what is the distribution of Y = g(X) given a random variable X with distribution function F (x)?

Theorem 15 (Transformation of Univariate Random Variables) Let X be a continuous ran- dom variable with probability function FX and probability density function fX (x). Define Y = g(X) where g is a monotone function. Let also X and Y denote the support of distributions for X and Y , respectively.

  1. If g is an increasing (decreasing) function, the probability function of Y is given by FY (y) = FX (g−^1 (y)) (FY (y) = 1 − FX (g−^1 (y))) for y ∈ Y.
  2. The probability density function of Y is given by

fY (y) =

fX (g−^1 (y))

∣ (^) dyd g−^1 (y)

∣ for^ y^ ∈ Y 0 otherwise

Example 14 Derive the probability function for the following transformations of a random vari- able.

  1. Uniform. Y = − log(X) where X ∼ U nif (0, 1).
  2. Inverse Gamma. Y = 1/X where X ∼ Gamma(α, β).

This rule can be generalized to the transformation of multivariate random variables.

Theorem 16 (Transformation of Bivariate Random Variables) Let (X, Y ) be a vector of two continuous random variables with joint probability density function fX,Y (x, y). Consider a bijective transformation U = g 1 (X, Y ) and V = g 2 (X, Y ). Define the inverse of this transformation as X = h 1 (U, V ) and Y = h 2 (U, V ). Then, the joint probability density function for (U, V ) is given by fU,V (u, v) = fX,Y (h 1 (u, v), h 2 (u, v))|J|

where J =

dx du

dx dy dv du

dy dv

∣∣ is called Jacobian.

Example 15 Derive the distribution of the following random variables.

  1. The joint distribution of U = X + Y and V = X − Y where both X and Y are independent standard normal random variables.
  2. The joint distribution of U = X + Y and V = X/Y where both X and Y are independent exponential random variables.