Probability Theory: De Morgan's Laws, Probability Axioms, and Conditional Probability, Study notes of Calculus

Fundamental concepts in probability theory, including De Morgan's Laws, the axioms of probability, and conditional probability. Topics include set operations, probability measures, and the concept of conditioning. Students will gain a solid understanding of the basic principles of probability theory and its applications.

Typology: Study notes

2021/2022

Uploaded on 07/05/2022

barbara_gr
barbara_gr 🇦🇺

4.6

(73)

1K documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Probability Cheatsheet v2.0
Original work at http://github.com/wzchen/probability_cheatsheet,
modified by Pablo Angulo for [email protected]
Last Updated August 3, 2020
Counting
Multiplication Rule
cake
waffle
S
V
C
S
V
CS
V
C
cake
waffle
cake
waffle
cake
waffle
Let’s say we have a compound experiment (an experiment with
multiple components). If the 1st component has n1possible outcomes,
the 2nd component has n2possible outcomes, . . . , and the rth
component has nrpossible outcomes, then overall there are
n1n2...nrpossibilities for the whole experiment.
Sampling Table
7
6
5
8
4
2
9
3
1
The sampling table gives the number of possible samples of size kout
of a population of size n, under various assumptions about how the
sample is collected.
Order Matters Not Matter
With Replacement nkn+k1
k
Without Replacement n!
(nk)! n
k
Cardano’s Definition of Probability
If the number of outcomes is finite and all outcomes are equally likely,
the probability of an event Ahappening is:
PCardano(A) = number of outcomes favorable to A
number of outcomes
Set algebra
Unions, Intersections, and Complements
Complements - The following are true.
AAc=
AAc=
De Morgan’s Laws
(AB)c=AcBc
(AB)c=AcBc
Probability
Axioms of probability
Any assignment from subsets of Eto real numbers is a probability
measure if the following holds:
Probabilities are positive P(A)0.
The probability of the whole space is 1P(E) = 1.
Probabilities of a union of disjoint sets
P(AB) = P(A) + P(B), provided AB=.
Consequences
For any probability measure, the following are true:
Probability of the empty set P() = 0.
Probability of the complement P(AC)=1P(A).
Conditional probability
Conditional Probability
P(A|B) = P(AB)
P(B)
Probability of A, given that Boccurred.
Conditional Probability is Probability P(A|B) is a probability
function for any fixed B. Any theorem that holds for probability also
holds for conditional probability.
Probability of an Intersection or Union
Intersections via Conditioning
P(A, B) = P(A)P(B|A)
P(A, B, C) = P(A)P(B|A)P(C|A, B )
Unions via Inclusion-Exclusion
P(AB) = P(A) + P(B)P(AB)
P(ABC) = P(A) + P(B) + P(C)
P(AB)P(AC)P(BC)
+P(ABC).
Law of Total Probability
Assume the nevents Aiare pairwise disjoint (AiAj=for any
i6=j) and their union is the whole sample space, and let Bbe any
event. Then:
P(B) = P(B|A1)P(A1) + ...+P(B|An)P(An)
=Pn
i=1 P(B|Ai)P(Ai)
Bayes’ Rule
P(A|B) = P(B|A)P(A)
P(B)
Independence
2 Independent Events Aand Bare independent if knowing whether
Aoccurred gives no information about whether Boccurred. More
formally, Aand B(which have nonzero probability) are independent if
and only if one of the following equivalent statements holds:
P(AB) = P(A)P(B), P(A|B) = P(A), P(B|A) = P(B)
3 Independent Events A,Band Care independent if information
about two of them gives no information about whether the third one
occurred. In other words, P(A|EBEC) = P(A), where EBis either
B,BC, or E, and ECis either C,CCor E. The relations obtained by
permuting A,Band Cmust also hold.
Conditional Independence Aand Bare conditionally independent
given Cif P(AB|C) = P(A|C)P(B|C). Conditional independence
does not imply independence, and independence does not imply
conditional independence.
Random Variables
ARandom Variable (RV) is a function form a probability space
into the real numbers:
X: R
The support of Xis the smallest closed set Ssuch that P(XS)=1
(morally, the set of values that Xcan take).
The distribution of the RV is not the probability distribution of Ω,
but the induced distribution in the real numbers AP(XA). E.g.
has a continuous uniform distribution in the interval [0,1], X(w) is
1 if w > 1/2 and 0 otherwise. Then Xhas a Bernoulli distribution
with p= 1/2.
Discrete Random Variables
A RV is discrete if its support is finite, or infinite countable (the
integer, the positive integers, etc). Its support is
{x : P(X=x)>0}.
PMF, CDF, and Independence
Probability Mass Function (PMF) Gives the probability that a
discrete random variable takes on the value x.
pX(x) = P(X=x)
0 1 2 3 4
0.0 0.2 0.4 0.6 0.8 1.0
x
pmf
The PMF satisfies
pX(x)0 and X
x
pX(x)=1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Probability Theory: De Morgan's Laws, Probability Axioms, and Conditional Probability and more Study notes Calculus in PDF only on Docsity!

Probability Cheatsheet v2.

Original work at http://github.com/wzchen/probability_cheatsheet, modified by Pablo Angulo for [email protected]

Last Updated August 3, 2020

Counting

Multiplication Rule

cake

waffl e

S

V

C

S

V

C S

V

C

cake waffle

cake waffle cake waffle

Let’s say we have a compound experiment (an experiment with multiple components). If the 1st component has n 1 possible outcomes, the 2nd component has n 2 possible outcomes,... , and the rth component has nr possible outcomes, then overall there are n 1 n 2... nr possibilities for the whole experiment.

Sampling Table

7

6

5

8

4

2

9 3

1

The sampling table gives the number of possible samples of size k out of a population of size n, under various assumptions about how the sample is collected.

Order Matters Not Matter

With Replacement nk^

(n + k − 1 k

Without Replacement

n! (n − k)!

(n

k

Cardano’s Definition of Probability

If the number of outcomes is finite and all outcomes are equally likely, the probability of an event A happening is:

PCardano(A) =

number of outcomes favorable to A number of outcomes

Set algebra

Unions, Intersections, and Complements

Complements - The following are true.

A ∪ Ac^ = Ω A ∩ Ac^ = ∅

De Morgan’s Laws

(A ∪ B)c^ = Ac^ ∩ Bc (A ∩ B)c^ = Ac^ ∪ Bc

Probability

Axioms of probability

Any assignment from subsets of E to real numbers is a probability measure if the following holds:

Probabilities are positive P (A) ≥ 0.

The probability of the whole space is 1 P (E) = 1.

Probabilities of a union of disjoint sets P (A ∪ B) = P (A) + P (B), provided A ∩ B = ∅.

Consequences

For any probability measure, the following are true:

Probability of the empty set P (∅) = 0.

Probability of the complement P (AC^ ) = 1 − P (A).

Conditional probability

Conditional Probability

P (A|B) =
P (A ∩ B)
P (B)

Probability of A, given that B occurred.

Conditional Probability is Probability P (A|B) is a probability function for any fixed B. Any theorem that holds for probability also holds for conditional probability.

Probability of an Intersection or Union

Intersections via Conditioning

P (A, B) = P (A)P (B|A)
P (A, B, C) = P (A)P (B|A)P (C|A, B)

Unions via Inclusion-Exclusion

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C)

  • P (A ∩ B ∩ C).

Law of Total Probability

Assume the n events Ai are pairwise disjoint (Ai ∩ Aj = ∅ for any i 6 = j) and their union is the whole sample space, and let B be any event. Then:

P (B) = P (B|A 1 )P (A 1 ) +... + P (B|An)P (An)

∑n i=1 P^ (B|Ai)P^ (Ai)

Bayes’ Rule

P (A|B) =
P (B|A)P (A)
P (B)

Independence

2 Independent Events A and B are independent if knowing whether A occurred gives no information about whether B occurred. More formally, A and B (which have nonzero probability) are independent if and only if one of the following equivalent statements holds:

P (A ∩ B) = P (A)P (B), P (A|B) = P (A), P (B|A) = P (B)

3 Independent Events A, B and C are independent if information about two of them gives no information about whether the third one occurred. In other words, P (A|EB ∩ EC ) = P (A), where EB is either B, BC^ , or E, and EC is either C, CC^ or E. The relations obtained by permuting A, B and C must also hold.

Conditional Independence A and B are conditionally independent given C if P (A ∩ B|C) = P (A|C)P (B|C). Conditional independence does not imply independence, and independence does not imply conditional independence.

Random Variables

A Random Variable (RV) is a function form a probability space into the real numbers: X : Ω → R The support of X is the smallest closed set S such that P (X ∈ S) = 1 (morally, the set of values that X can take). The distribution of the RV is not the probability distribution of Ω, but the induced distribution in the real numbers A → P (X ∈ A). E.g. Ω has a continuous uniform distribution in the interval [0, 1], X(w) is 1 if w > 1 /2 and 0 otherwise. Then X has a Bernoulli distribution with p = 1/2.

Discrete Random Variables

A RV is discrete if its support is finite, or infinite countable (the integer, the positive integers, etc). Its support is {x ∈ Ω : P (X = x) > 0 }.

PMF, CDF, and Independence

Probability Mass Function (PMF) Gives the probability that a discrete random variable takes on the value x.

pX (x) = P (X = x)

0 1 2 3 4

x

pmf

l

l

l l

l

The PMF satisfies

pX (x) ≥ 0 and

x

pX (x) = 1

Cumulative Distribution Function (CDF) Gives the probability that a random variable is less than or equal to x.

FX (x) = P (X ≤ x)

0 1 2 3 4

x

cdf

l

l l

l l

l l

l l

l

The CDF is an increasing, right-continuous function with

FX (x) → 0 as x → −∞ and FX (x) → 1 as x → ∞

Independence Intuitively, two random variables are independent if knowing the value of one gives no information about the other. Discrete r.v.s X and Y are independent if for all values of x and y

P (X = x, Y = y) = P (X = x)P (Y = y)

Continuous Random Variables (CRVs)

Probability density function (PDF)

What’s the probability that a CRV is in an interval? Take the difference in CDF values (or use the PDF as described later).

P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = FX (b) − FX (a)

For X ∼ N (μ, σ^2 ), this becomes

P (a ≤ X ≤ b) = Φ

b − μ σ

a − μ σ

What is the Probability Density Function (PDF)? The PDF f is the derivative of the CDF F.

F ′(x) = f (x)

A PDF is nonnegative and integrates to 1. By the fundamental theorem of calculus, to get from PDF back to CDF we can integrate:

F (x) =

∫ (^) x

−∞

f (t)dt

0.00−4 −2 0 2 4

x

PDF

−4 −2 0 2 4

x

CDF

To find the probability that a CRV takes on a value in an interval, integrate the PDF over that interval.

F (b) − F (a) =

∫ (^) b

a

f (x)dx

Expected Value and Indicators

Expected Value and Linearity

Expected Value (a.k.a. mean, expectation, or average) is a weighted average of the possible outcomes of our random variable. Mathematically, if x 1 , x 2 , x 3 ,... are all of the distinct possible values that a discrete random variable X can take, the expected value of X is

E(X) =

i

xiP (X = xi)

Expected value of a CRV Analogous to the discrete case, where you sum x times the PMF, for CRVs you integrate x times the PDF.

E(X) =

−∞

xf (x)dx

X

3 2 6 10 1 1 5 4 ...

Y

4 2 8 23

  • 0 9 1 ...
X + Y

7 4 14 33

  • 1 14 5 ...

xi + ∑ yi = ∑ ( xi + yi )

E ( X ) + E ( Y ) = E ( X + Y )

i= 1

n i= 1

n i= 1

n n 1 n 1 n 1

Linearity For any r.v.s X and Y , and constants a, b, c,

E(aX + bY + c) = aE(X) + bE(Y ) + c

Same distribution implies same mean If X and Y have the same distribution, then E(X) = E(Y ) and, more generally,

E(g(X)) = E(g(Y ))

Conditional Expected Value is defined like expectation, only conditioned on any event A.

E(X|A) =

x

xP (X = x|A)

Indicator Random Variables

Indicator Random Variable is a random variable that takes on the value 1 or 0. It is always an indicator of some event: if the event occurs, the indicator is 1; otherwise it is 0. They are useful for many problems about counting how many events of some kind occur. Write

IA =

1 if A occurs, 0 if A does not occur.

Note that I A^2 = IA, IAIB = IA∩B , and IA∪B = IA + IB − IAIB. Distribution IA ∼ Bern(p) where p = P (A).

Fundamental Bridge The expectation of the indicator for event A is the probability of event A: E(IA) = P (A).

Variance and Standard Deviation

Var(X) = E (X − E(X)) 2 = E(X 2 ) − (E(X)) 2

SD(X) =

Var(X)

LOTUS, UoU

LOTUS

Expected value of a function of an r.v. The expected value of X is defined this way:

E(X) =

x

xP (X = x) (for discrete X)

E(X) =

−∞

xf (x)dx (for continuous X)

The Law of the Unconscious Statistician (LOTUS) states that you can find the expected value of a function of a random variable, g(X), in a similar way, by replacing the x in front of the PMF/PDF by g(x) but still working with the PMF/PDF of X:

E(g(X)) =

x

g(x)P (X = x) (for discrete X)

E(g(X)) =

−∞

g(x)f (x)dx (for continuous X)

What’s a function of a random variable? A function of a random variable is also a random variable. For example, if X is the number of bikes you see in an hour, then g(X) = 2X is the number of bike wheels you see in that hour and h(X) =

(X

2

X(X−1) 2 is the number of pairs of bikes such that you see both of those bikes in that hour.

What’s the point? You don’t need to know the PMF/PDF of g(X) to find its expected value. All you need is the PMF/PDF of X.

Universality of Uniform (UoU)

When you plug any CRV into its own CDF, you get a Uniform(0,1) random variable. When you plug a Uniform(0,1) r.v. into an inverse CDF, you get an r.v. with that CDF. For example, let’s say that a random variable X has CDF

F (x) = 1 − e−x, for x > 0

By UoU, if we plug X into this function then we get a uniformly distributed random variable.

F (X) = 1 − e −X ∼ Unif(0, 1)

Similarly, if U ∼ Unif(0, 1) then F −^1 (U ) has CDF F. The key point is that for any continuous random variable X, we can transform it into a Uniform random variable and back by using its CDF.

Moments

Moments describe the shape of a distribution. Let X have mean μ and standard deviation σ, and Z = (X − μ)/σ be the standardized version of X. The kth moment of X is μk = E(Xk^ ) and the kth standardized moment of X is mk = E(Zk^ ). The mean, variance, skewness, and kurtosis are important summaries of the shape of a distribution.

Mean E(X) = μ 1

Variance Var(X) = μ 2 − μ^21

Skewness Skew(X) = m 3

Kurtosis Kurt(X) = m 4 − 3

Binomial Distribution

0 2 4 6 8 10

x

pmf

l l

l

l

l

l l

l

l l (^) l 0 2 4 6 8 10

Bin(10,1/2)

Let us say that X is distributed Bin(n, p). We know the following:

Story X is the number of “successes” that we will achieve in n independent trials, where each trial is either a success or a failure, each with the same probability p of success. We can also write X as a sum of multiple independent Bern(p) random variables. Let X ∼ Bin(n, p) and Xj ∼ Bern(p), where all of the Bernoullis are independent. Then

X = X 1 + X 2 + X 3 + · · · + Xn

Example If Jeremy Lin makes 10 free throws and each one independently has a 34 chance of getting in, then the number of free throws he makes is distributed Bin(10, 34 ).

Properties Let X ∼ Bin(n, p), Y ∼ Bin(m, p) with X ⊥⊥ Y.

ˆ Redefine success n − X ∼ Bin(n, 1 − p) ˆ Sum X + Y ∼ Bin(n + m, p) ˆ Conditional X|(X + Y = r) ∼ HGeom(n, m, r) ˆ Binomial-Poisson Relationship Bin(n, p) is approximately Pois(λ) if p is small. ˆ Binomial-Normal Relationship Bin(n, p) is approximately N (np, np(1 − p)) if n is large and p is not near 0 or 1.

Geometric Distribution

Let us say that X is distributed Geom(p). We know the following:

Story X is the number of “trials” that we will repeat before we observe our first success. Our successes have probability p.

Example If each pokeball we throw has probability 101 to catch Mew,

the number of pokeballs thrown will be distributed Geom( 101 ).

Poisson Distribution

Let us say that X is distributed Pois(λ). We know the following:

Story There are rare events (low probability events) that occur many different ways (high possibilities of occurences) at an average rate of λ occurrences per unit space or time. The number of events that occur in that unit of space or time is X.

Example A certain busy intersection has an average of 2 accidents per month. Since an accident is a low probability event that can happen many different ways, it is reasonable to model the number of accidents in a month at that intersection as Pois(2). Then the number of accidents that happen in two months at that intersection is distributed Pois(4).

Properties Let X ∼ Pois(λ 1 ) and Y ∼ Pois(λ 2 ), with X ⊥⊥ Y.

  1. Sum X + Y ∼ Pois(λ 1 + λ 2 )
  2. Conditional X|(X + Y = n) ∼ Bin

n, λ 1 λ 1 +λ 2

  1. Chicken-egg If there are Z ∼ Pois(λ) items and we randomly and independently “accept” each item with probability p, then the number of accepted items Z 1 ∼ Pois(λp), and the number of rejected items Z 2 ∼ Pois(λ(1 − p)), and Z 1 ⊥⊥ Z 2.

Continuous Distributions

Uniform Distribution

Let us say that U is distributed Unif(a, b). We know the following: Properties of the Uniform For a Uniform distribution, the probability of a draw from any interval within the support is proportional to the length of the interval. See Universality of Uniform and Order Statistics for other properties. Example William throws darts really badly, so his darts are uniform over the whole room because they’re equally likely to appear anywhere. William’s darts have a Uniform distribution on the surface of the room. The Uniform is the only distribution where the probability of hitting in any specific region is proportional to the length/area/volume of that region, and where the density of occurrence in any one specific spot is constant throughout the whole support.

Normal Distribution

Let us say that X is distributed N (μ, σ^2 ). We know the following: Central Limit Theorem The Normal distribution is ubiquitous because of the Central Limit Theorem, which states that the sample mean of i.i.d. r.v.s will approach a Normal distribution as the sample size grows, regardless of the initial distribution. Location-Scale Transformation Every time we shift a Normal r.v. (by adding a constant) or rescale a Normal (by multiplying by a constant), we change it to another Normal r.v. For any Normal X ∼ N (μ, σ^2 ), we can transform it to the standard N (0, 1) by the following transformation:

Z =

X − μ σ

∼ N (0, 1)

Standard Normal The Standard Normal, Z ∼ N (0, 1), has mean 0 and variance 1. Its CDF is denoted by Φ.

Exponential Distribution

Let us say that X is distributed Expo(λ). We know the following: Story You’re sitting on an open meadow right before the break of dawn, wishing that airplanes in the night sky were shooting stars, because you could really use a wish right now. You know that shooting stars come on average every 15 minutes, but a shooting star is not “due” to come just because you’ve waited so long. Your waiting time is memoryless; the additional time until the next shooting star comes does not depend on how long you’ve waited already. Example The waiting time until the next shooting star is distributed Expo(4) hours. Here λ = 4 is the rate parameter, since shooting stars arrive at a rate of 1 per 1/4 hour on average. The expected time until the next shooting star is 1/λ = 1/4 hour. Expos as a rescaled Expo(1) Y ∼ Expo(λ) → X = λY ∼ Expo(1)

Memorylessness The Exponential Distribution is the only continuous memoryless distribution. The memoryless property says that for X ∼ Expo(λ) and any positive numbers s and t, P (X > s + t|X > s) = P (X > t) Equivalently, X − a|(X > a) ∼ Expo(λ) For example, a product with an Expo(λ) lifetime is always “as good as new” (it doesn’t experience wear and tear). Given that the product has survived a years, the additional time that it will last is still Expo(λ). Min of Expos If we have independent Xi ∼ Expo(λi), then min(X 1 ,... , Xk ) ∼ Expo(λ 1 + λ 2 + · · · + λk ). Max of Expos If we have i.i.d. Xi ∼ Expo(λ), then max(X 1 ,... , Xk ) has the same distribution as Y 1 + Y 2 + · · · + Yk , where Yj ∼ Expo(jλ) and the Yj are independent.

Gamma Distribution

0 5 10 15 20

Gamma(3, 1)

x

PDF

0 5 10 15 20

Gamma(3, 0.5)

x

PDF

0 5 10 15 20

Gamma(10, 1)

x

PDF

0 5 10 15 20

Gamma(5, 0.5)

x

PDF

Let us say that X is distributed Gamma(a, λ). We know the following:

Story You sit waiting for shooting stars, where the waiting time for a star is distributed Expo(λ). You want to see n shooting stars before you go home. The total waiting time for the nth shooting star is Gamma(n, λ). Example You are at a bank, and there are 3 people ahead of you. The serving time for each person is Exponential with mean 2 minutes. Only one person at a time can be served. The distribution of your waiting time until it’s your turn to be served is Gamma(3, 12 ).

(Chi-Square) Distribution

Let us say that X is distributed χ^2 n. We know the following: Story A Chi-Square(n) is the sum of the squares of n independent standard Normal r.v.s. Properties and Representations

X is distributed as Z 2 1 +^ Z

2 2 +^ · · ·^ +^ Z

2 n for i.i.d.^ Zi^ ∼ N^ (0,^ 1) X ∼ Gamma(n/ 2 , 1 /2)

LLN, CLT

Law of Large Numbers (LLN)

Let X 1 , X 2 , X 3... be i.i.d. with mean μ. The sample mean is

X¯n = X^1 +^ X^2 +^ X^3 +^ · · ·^ +^ Xn n The Law of Large Numbers states that as n → ∞, X¯n → μ with probability 1. For example, in flips of a coin with probability p of Heads, let Xj be the indicator of the jth flip being Heads. Then LLN says the proportion of Heads converges to p (with probability 1).

Central Limit Theorem (CLT)

Approximation using CLT

We use ∼˙ to denote is approximately distributed. We can use the Central Limit Theorem to approximate the distribution of a random variable Y = X 1 + X 2 + · · · + Xn that is a sum of n i.i.d. random variables Xi. Let E(Y ) = μY and Var(Y ) = σ Y^2. The CLT says

Y ∼ N˙ (μY , σ^2 Y )

If the Xi are i.i.d. with mean μX and variance σ^2 X , then μY = nμX and σ^2 Y = nσ^2 X. For the sample mean X¯n, the CLT says

X¯n =^1 n

(X 1 + X 2 + · · · + Xn) ˙∼ N (μX , σ 2 X /n)

Asymptotic Distributions using CLT

We use D −→ to denote converges in distribution to as n → ∞. The CLT says that if we standardize the sum X 1 + · · · + Xn then the distribution of the sum converges to N (0, 1) as n → ∞:

1 σ

n

(X 1 + · · · + Xn − nμX ) D −→ N (0, 1)

In other words, the CDF of the left-hand side goes to the standard Normal CDF, Φ. In terms of the sample mean, the CLT says √ n( X¯n − μX ) σX

D −→ N (0, 1)

Continuous Multivariate Distributions

Joint Probability density f (x, y) ; P ((X, Y ) ∈ A) =

A f^ (x, y).

Marginal density fx(x) =

R f^ (x, y)^ dy^ ;^ P^ (X^ ∈^ C) =^

C fx(x)^ dx.

Multivariate Uniform Distribution

See the univariate Uniform for stories and examples. For the 2D Uniform on some region, probability is proportional to area. Every point in the support has equal density, of value (^) area of region^1. For the 3D Uniform, probability is proportional to volume.

Multivariate Normal (MVN) Distribution

A vector X~ = (X 1 , X 2 ,... , Xd) is Multivariate Normal if every linear combination is Normally distributed, i.e., t 1 X 1 + t 2 X 2 + · · · + tdXd is Normal for any constants t 1 , t 2 ,... , td. The parameters of the Multivariate Normal are the mean vector ~μ = (μ 1 , μ 2 ,... , μd) and the covariance matrix Σ where the (i, j) entry is Cov(Xi, Xj ).

Properties The Multivariate Normal has the following properties.

ˆ Any subvector is also MVN. ˆ If any two elements within an MVN are uncorrelated, then they are independent. ˆ The joint PDF of a Multivariate Normal is: f (x) = det((2π)dΣ)−^

1 (^2) e−^ 1 2 (x−μ) ′Σ− (^1) (x−μ)

Distribution Properties

Convolutions of Random Variables

A convolution of n random variables is simply their sum. For the following results, let X and Y be independent.

  1. X ∼ Pois(λ 1 ), Y ∼ Pois(λ 2 ) −→ X + Y ∼ Pois(λ 1 + λ 2 )
  2. X ∼ Bin(n 1 , p), Y ∼ Bin(n 2 , p) −→ X + Y ∼ Bin(n 1 + n 2 , p). Bin(n, p) can be thought of as a sum of i.i.d. Bern(p) r.v.s.
  3. X ∼ Gamma(a 1 , λ), Y ∼ Gamma(a 2 , λ) −→ X + Y ∼ Gamma(a 1 + a 2 , λ). Gamma(n, λ) with n an integer can be thought of as a sum of i.i.d. Expo(λ) r.v.s.
  4. X ∼ N (μ 1 , σ^21 ), Y ∼ N (μ 2 , σ^22 ) −→ X + Y ∼ N (μ 1 + μ 2 , σ^21 + σ^22 )

Special Cases of Distributions

  1. Bin(1, p) ∼ Bern(p)
  2. Beta(1, 1) ∼ Unif(0, 1)
  3. Gamma(1, λ) ∼ Expo(λ)

Inequalities

  1. Cauchy-Schwarz |E(XY )| ≤
E(X^2 )E(Y 2 )
  1. Markov P (X ≥ a) ≤ E|X| a for^ a >^0
  2. Chebyshev P (|X − μ| ≥ a) ≤ σ

2 a^2 for^ E(X) =^ μ,^ Var(X) =^ σ

2

  1. Jensen E(g(X)) ≥ g(E(X)) for g convex; reverse if g is concave

Miscellaneous Definitions

Precision The precision of a distribution is the inverse of the variance τ = 1 σ^2

Mode The mode of a discrete distribution is the point in the support that maximizes the PMF. The mode of a continuous distribution is the point in the support that maximizes the PDF.

Medians and Quantiles Let X have CDF F. Then X has median m if F (m) ≥ 0 .5 and P (X ≥ m) ≥ 0 .5. For X continuous, m satisfies F (m) = 1/2. In general, the ath quantile of X is min{x : F (x) ≥ a}; the median is the case a = 1/2.

log Statisticians generally use log to refer to natural log (i.e., base e).

i.i.d r.v.s Independent, identically-distributed random variables.

Gamma and Beta Integrals

You can sometimes solve complicated-looking integrals by pattern-matching to a gamma or beta integral: ∫ (^) ∞

0

x t− 1 e −x dx = Γ(t)

0

x a− 1 (1 − x) b− 1 dx =

Γ(a)Γ(b) Γ(a + b)

Also, Γ(a + 1) = aΓ(a), and Γ(n) = (n − 1)! if n is a positive integer.

Maximum likelihood

The RV X follows a parametric distribution X ∼ D(λ). We don’t know λ, but we have n independent observations {xj }nj=1 from X. The likelihood of λ is: ˆ If D(λ) is discrete with mass function pλ(x):

L

λ| {xj }nj=

= P

{xj }nj=1|λ

∏^ n

j=

pλ(xj )

ˆ If D(λ) es continuous with density fλ(x):

L

λ| {xj } n j=

∏^ n

j=

fλ(xj )

The maximum likelihood estimator of λ is the value λ* that maximizes the likelihood:

λ∗^ = argmaxλ L (λ| (xj ))

Conjugate families

In the Bayesian approach to statistics, parameters are uncertain, so we assign a probability distribution to them. The prior for a parameter is its distribution before observing data. The posterior is the distribution for the parameter after observing data.

The Beta family The Beta is a parametric family of distributions depending on two parameters a, b, used to represent uncertainty about a real number p known to lie in the interval [0, 1] (for instance, a probability).

f (x) =

Γ(a + b) Γ(a)Γ(b)

x a− 1 (1 − x) b− 1 , x ∈ (0, 1)

0.0 0.2 0.4 0.6 0.8 1.

0

1

2

3

4

5

Beta(0.5, 0.5)

x

PDF

0.0 0.2 0.4 0.6 0.8 1.

0.^ 0.^

Beta(2, 1)

x

PDF

0.0 0.2 0.4 0.6 0.8 1.

Beta(2, 8)

x

PDF

0

1

2

3

0.0 0.2 0.4 0.6 0.8 1.

2.^ 2.^

Beta(5, 5)

x

PDF

Beta is the Conjugate Prior of Bernoulli experiments Beta is the conjugate prior of the Binomial because if you have a Beta-distributed prior on p in a Binomial, then the posterior distribution on p given the Binomial data is also Beta-distributed. Consider the following two-level model:

X|p ∼ Bin(n, p) p ∼ Beta(a, b)

Then after observing X = x, we get the posterior distribution

p|(X = x) ∼ Beta(a + x, b + n − x).

Beta is also the conjugate prior of the Geometric: if you have a Beta-distributed prior on p, and the experiment follows a Geometric distribution based on p:

Y |p ∼ Geom(p) p ∼ Beta(a, b) Then after observing Y = y, we get the posterior distribution

p|(Y = y) ∼ Beta(a + 1, b + x − 1).

Gamma is the Conjugate Prior of a Poisson Process If our uncertainty for the rate λ of a Poisson process is modelled with a Gamma(α, β), and we count x observations on a time interval of length T , then our posterior follows λ ∼ Gamma(α + x, β + T ). Maximum A Posteriori (MAP) The MAP estimator is the mode of the posterior. It can be regarded as a smoothed version of the maximum likelihood estimator.

Objective priors

In absence of prior information, it is customary to use a prior that carries as little information as possible. These are called objective priors. There are several notions of objective prior, but most of them are improper: they are not true probability distributions, so we don’t really have prior probabilistic information. However, an improper prior can be updated with data to provide a proper posterior : a true probability distribution that can answer probabilistic questions and give expected values. Usually, the objective prior can be interpreted as a limiting case of the conjugate family, and the updating rule for the conjugate family still holds.

scipy.stats

A frozen distribution N = scipy.stats.norm(loc=mean, scale=std)

Random sample of size M N.rvs(M)

Mean N.mean()

Variance N.var()

Distribution function at points xs (array) N.cdf(xs)

Density function at points xs (if continuous) N.pdf(xs)

Mass function at points xs (if discrete) N.pmf(xs)

Percentiles ps N.ppf(ps)

pandas

Create a dataframe:

df = pd. DataFrame ( data = { " calculus " : [10 ,5 ,8 ,7] , " algebra " : [8 ,7 ,6 ,5] , " probability " : [7 ,6 ,6 ,8] , } , index = [ " Jaimita " , " Fulanito " , " Menganito " , " Zutanita " ] , )

Browse first rows df.head(2)

Summary of column types df.info()

Column statistics df.describe(include="all")

Selecting a column df["calculus"] (the result is a Series)

max of a Series df["calculus"].max()

mean of a Series df["calculus"].mean()

std of a Series df["calculus"].std()

Selecting a column df["calculus"] (the result is a Series)

Selecting columns df[["calculus", "probability"]]

Selecting rows by index df.loc[["Jaimita", "Fulano"]]

Selecting rows by row number df.iloc[1:3]

Selecting rows and columns df.loc[ list of indices, list of columns]

Selecting rows by condition df[df["calculus"]>7]

Plot histogram df["calculus"].hist()

Scatter plot df.plot.scatter("algebra", "calculus")

Drop rows df.drop(["Jaimita", "Fulano"], inplace=True)

Drop columns df.drop(["calculus", "probability"], inplace=True)

Read a csv file advertising = pd.read csv("advertising.csv", usecols=[1,2,3,4])

scikit-learn

Fit a linear model, print R^2 score:

import sklearn. linear_model as skl_lm regr = skl_lm. LinearRegression () X = advertising [[ " TV " , " Radio " , " Newspaper " ]] y = advertising [ " Sales " ] regr. fit (X , y ) print ( regr. score () )

Make predictions

advertising_future = pd. DataFrame ( [ [100 ,30 ,30] , [100 ,40 ,30] , ] , columns =[ " TV " , " Radio " , " Newspaper " ] ) regr. predict ( advertising_future )

Fit a polinomial model, split randomly into train and test sets: from sklearn. model_selection import train_test_split from sklearn. preprocessing import PolynomialFeatures poly = PolynomialFeatures ( degree =2) X = poly. fit_transform ( auto [[ " horsepower " ]]) y = auto [ " mpg " ] Xtrain , Xtest , ytrain , ytest = train_test_split (X , y , test_size =0.25) regr = skl_lm. LinearRegression () regr. fit ( Xtrain , ytrain ) print ( regr. score ( Xtest , ytest ) ) regr. predict ( poly. fit_transform ([[250]]) )

statsmodels

import statsmodels. formula. api as smf regr = smf. ols ( " Sales ~ TV + Radio " , advertising ). fit () est. predict ( advertising_future ) regr. summary ()

R-squared R^2 = 1 − RSST SS , where RSS =

(yj − f (xj ))^2 , T SS =

(yj − y¯)^2. Always smaller than 1. The larger the better.

adjusted R-squared Adjusted - R^2 = 1 − RSS/(n−p−1) T SS/(n−1) , where^ n^ is the number of data points, and p is the number of explanatory variables. The larger the better. AIC Akaike Information Criterion. The smaller the better. The absolute value is not important. A difference of ≈ 1 .4 between the AIC of model A and the AIC of model B means that model A is twice as likely as model B to minimize information loss, regardless of the magnitude of the AIC. BIC Bayes Information Criterion. The smaller the better. As for AIC, only the differences in BIC matter, and not their absolute value. Intercept Independent term in the linear model. P>|t| p-value for t-statistic for each coefficient. If one of them is greater than 0.05, you should consider removing that explanatory variable. [0.025 0.975] confidence interval for each coefficient. Values in the interval are not “unreasonable”.

Computations with one dimensional Random Variables

Object of interest Finite Infinite discrete Continuous Sample
Support A finite set F
An infinite but countable set I
e.g. N, Z,...
A subset S of R
e.g. R, (0, ∞), (a, b)
A sample of size N
{x 1 ,... , xN }
P (X ∈ A)
probability of A

k∈A∩F

pX (k)
pX is the mass function

k∈A∩I

pX (k)
pX is the mass function
A∩S fX^ (x)^ dx
fX is the density function
P (A) ≈ Psample(A)
Psample(A) =

number of xi that lie in A N

P (X ≤ t)
FX (t)
FX is the distribution function
P (X ≤ t) ≈ Fsample(A)
Fsample(A) =

number of xi smaller than t N

faster to compute if the sample is ordered
g(X)
transformation of X by g
g is inyective
g(X) is finite
pg(X)(k) = pX (g−^1 (k))
g(X) is discrete infinite
pg(X)(k) = pX (g−^1 (k))
g(X) is continuous if g is smooth
fg(X)(k) = fX (g−^1 (x))
g−^1
(x)
{g(x 1 ),... , g(xN )}
is a sample of g(X) of size N
g(X)
transformation of X by g
g is not inyective can get complicated
{g(x 1 ),... , g(xN )}
is a sample of g(X) of size N
X|A
conditioning the RV X
by the event A
X|A is finite
pX|A(k) =

pX (k) P (A)

X|A is discrete
pX|A(k) =

pX (k) P (A)

X|A is continuous
fX|A(x) =

fX (x) P (A)

filter {x 1 ,... , xN }
keep only the xj that lie in A
get a sample of X|A of size smaller than N
E[X]
expectation of X

k∈F

k pX (k)
a finite sum

k∈I

k pX (k)
an infinite series
S x fX^ (x)^ dx
an integral E[X] ≈ sample mean =

ΣNi=1xi N

E[g(X)]
expectation of g(X)

k∈F

g(k) pX (k)
a finite sum

k∈I

g(k) pX (k)
an infinite series
S g(x)^ fX^ (x)^ dx
an integral E[g(X)] ≈

ΣNi=1g(xi) N

X + Y
sum of RVs X and Y
can get complicated (involves “convolutions”)
except in a few special cases
{x 1 + y 1 ,... , xN + yN }
is a sample of X + Y of size N
X follows a parametric distribution
X ∼ D(Y, Z,... )
the parameters Y, Z,... are RVs rather complicated except in a few special cases
first sample yj ∈ Y, zj ∈ Z,...
then sample xj from D(yj , zj ,... )
{x 1 ,... , xN } is a sample of X of size N

Maximum likelihood and Conjugate distributions

Data Likelihood Unknown Parameters Max Likelihood Conjugate prior Conjugate posterior MAP
x is 0 or 1
a single Bernoulli trial
Bernoulli
X ∼ Bern(p)
a probability
p ∈ [0, 1] ˆp = x p ∼ Beta(a, b)
p ∼ Beta(a, b + 1) if x = 0
p ∼ Beta(a + 1, b) if x = 1 p = a a++xb−− 11
xj is 0 or 1
j ∈ { 1 ,... , n}
n Bernoulli trials
Bernoulli
Xj ∼ Bern(p)
a probability
p ∈ [0, 1] pˆ =

∑n j=1 xj

n p^ ∼^ Beta(a, b)
p ∼ Beta(a + e, b + f )
e =
∑n
j=1 xj^ successes
f = n −
∑n
j=1 xj^ failures^ p^ =^

a+ ∑n j=1 xj^ −^1 a+b+n− 2

x ∈ { 0 ,... , n}
a binomial experiment with n items
Binomial
X ∼ Bin(p, n)
a probability
p ∈ [0, 1] pˆ = xn p ∼ Beta(a, b)
p ∼ Beta(a + x, b + f )
x successes, f = n − x failures p = aa++b+x−n−^12
x ∈ { 1 , 2... }
a single geometric experiment
Geometric
X ∼ Geom(p)
a probability
p ∈ [0, 1] ˆp =

1

x p^ ∼^ Beta(a, b)
p ∼ Beta(a + 1, b + f )
1 success, f = x − 1 failures p =

a a+b+x− 2

x ∈ { 1 , 2... }
j ∈ { 1 ,... , n}
n geometric experiments
Geometric
Xj ∼ Geom(p)
a probability
p ∈ [0, 1] pˆ =

∑n n j=1 xj^

p ∼ Beta(a, b)
p ∼ Beta(a + n, b + f )
n successes
f =
xj − n failures p =

a+n− 1 a+b+n+f − 2

x ∈ { 1 , 2... }
a Poisson experiment
on a time interval of length T
Poisson
X ∼ Pois(T λ)
the process rate
λ > 0 ˆλ = x

T

λ ∼ Gamma(α, β)
λ ∼ Gamma(α + x, β + T )
x observations, time T λ = α+x−^1

β+T

tj ∈ R+
time between observations of Poisson process
j ∈ { 1 ,... , n}
Exponential
Xj ∼ Expo(λ) a rate λ > 0 ˆλ = ∑nn

j=1 tj^

λ ∼ Gamma(α, β)
λ ∼ Gamma(α + n, β + T )
n observations
total time T =
∑n
j=1 tj^ λ^ =^

α+n− 1 β+

∑n j=1 tj

xj ∈ R
a Gaussian with known mean μ
j ∈ { 1 ,... , n}
Gaussian
Xj ∼ N (μ, σ)
the Gaussian variance
or the Gaussian precision

σ^2

σˆ^2 =

∑n i=1(xi−μ)

2 n

τ ˆ = ∑n n

i=1(xi−μ)

2 τ^ ∼^ Gamma(α, β)
τ ∼ Gamma( ˜α, β˜)
˜α = α + n 2
β^ ˜ = β +

∑n i=1(xi−μ)

2 2

α˜− 1 β˜

xj ∈ R
a Gaussian with known precision τ =

1 σ^2

j ∈ { 1 ,... , n}
Gaussian
Xj ∼ N (μ, σ)
the Gaussian mean
μ ∈ R μˆ =

∑n j=1 xj

n μ^ ∼ N^ (m, t^ =^

1

s^2 )
μ ∼ N ( ˜m, ˜t)
m˜ =

t m+τ

∑n i=1 xi t+nτ

˜t = t + nτ m˜
xj ∈ R
a Gaussian with unknown parameters
Gaussian
Xj ∼ N (μ, σ)
the Gaussian
mean and variance
μ ∈ R, σ ∈ R
μ ˆ = ¯x =

1 n

∑n
j=1 xj
σ^ ˆ^2 = 1

n

∑n
j=1(xj^ −^ x¯)
2 ... Normal-Gamma(m, t, α, β) ...
xj ∈ Rp
a Gaussian vector
with unknown parameters
Gaussian
Xj ∼ N (μ, Σ)
the Gaussian parameters
μ ∈ Rp, Σ ∈ Rp×p
ˆμ = ¯x = 1 n
∑n
j=1 xj
Σ =^ ˆ 1

n

∑n
j=1(xj^ −^ x¯)^ ·^ (xj^ −^ ¯x)
t ... Normal-Wishart ...