A Probability and Statistics Refresher, Schemes and Mind Maps of Statistics

In the context of simulation, we often use statistical techniques to develop ... Events A and B are disjoint if they have no elements in common: A ∩ B = 0.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 03/01/2023

astarloa
astarloa 🇺🇸

4.2

(12)

298 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 590M
A Probability and Statistics Refresher
Peter J. Haas
Probability theory provides a mathematical framework for modelling real-world situations char-
acterized by “randomness,” “uncertainty,” or “unpredictability.” The theory also provides a toolbox
of techniques for computing probabilities of interest, as well as related quantities such as expected
values. As with any mathematical tool, the usefulness of a probability model depends on how well
the model mirrors the real-world situation of interest. The field of statistics is largely concerned
with matching probabilistic models to data. A statistical analysis starts with a set of data and
assumes that the data is generated according to a probability model. The data is then used to make
inferences about this probability model. The goal can be to fit “the best” probability model to
the data, to estimate the values of certain parameters of an assumed model, or to test hypotheses
about the model. The results of the inference step can be used for purposes of prediction, analysis,
and decisionmaking.
In the context of simulation, we often use statistical techniques to develop our simulation
model of the system under study; a simulation model is essentially a complicated probability
model. For such a model, the probabilities and expected values that are needed for prediction and
decisionmaking typically cannot be computed analytically or numerically. We therefore simulate
the model, i.e., use the model to generate data, and then once again apply statistical techniques to
infer the model properties of interest based on the output of the simulation.
The following sections summarize some basic topics in probability and statistics. Our emphasis
is on material that we will use extensively in the course. Our discussion glosses over some of the
technical fine points—see Section 11 for some pointers to more careful discussions.
1 Probabilistic Experiments and Events
1.1 Probabilistic Experiments
We start with a set Ω, called the sample space, that represents the possible elementary outcomes
of a probabilistic experiment. A classic example of a simple probabilistic experiment is the rolling
of a pair of dice, one black and one white. Record the outcome of this experiment by a pair (n, m),
where n(resp., m) is the number of spots showing on the black (resp., white) die. Then is the
set that contains the 36 possible elementary outcomes:
= {(1,1),(1,2),...,(1,6),(2,1),(2,2),...,(2,6),...,(6,1),(6,2),...,(6,6) }.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download A Probability and Statistics Refresher and more Schemes and Mind Maps Statistics in PDF only on Docsity!

A Probability and Statistics Refresher

Peter J. Haas

Probability theory provides a mathematical framework for modelling real-world situations char- acterized by “randomness,” “uncertainty,” or “unpredictability.” The theory also provides a toolbox of techniques for computing probabilities of interest, as well as related quantities such as expected values. As with any mathematical tool, the usefulness of a probability model depends on how well the model mirrors the real-world situation of interest. The field of statistics is largely concerned with matching probabilistic models to data. A statistical analysis starts with a set of data and assumes that the data is generated according to a probability model. The data is then used to make inferences about this probability model. The goal can be to fit “the best” probability model to the data, to estimate the values of certain parameters of an assumed model, or to test hypotheses about the model. The results of the inference step can be used for purposes of prediction, analysis, and decisionmaking. In the context of simulation, we often use statistical techniques to develop our simulation model of the system under study; a simulation model is essentially a complicated probability model. For such a model, the probabilities and expected values that are needed for prediction and decisionmaking typically cannot be computed analytically or numerically. We therefore simulate the model, i.e., use the model to generate data, and then once again apply statistical techniques to infer the model properties of interest based on the output of the simulation. The following sections summarize some basic topics in probability and statistics. Our emphasis is on material that we will use extensively in the course. Our discussion glosses over some of the technical fine points—see Section 11 for some pointers to more careful discussions.

1 Probabilistic Experiments and Events

1.1 Probabilistic Experiments

We start with a set Ω, called the sample space, that represents the possible elementary outcomes of a probabilistic experiment. A classic example of a simple probabilistic experiment is the rolling of a pair of dice, one black and one white. Record the outcome of this experiment by a pair (n, m), where n (resp., m) is the number of spots showing on the black (resp., white) die. Then Ω is the set that contains the 36 possible elementary outcomes:

Ω = { (1, 1), (1, 2),... , (1, 6), (2, 1), (2, 2),... , (2, 6),... , (6, 1), (6, 2),... , (6, 6) }.

Figure 1: Definition of event for dartboard example.

As a more complicated example, consider an experiment in which we throw a dart at a square dartboard. (We are assuming here that our throw is so erratic that it can truly be considered unpredictable.) One way to describe the outcome of this experiment is to give the (x, y) coordinates of the dart’s position, where we locate the origin (0, 0) at the lower left corner. Note that Ω is an uncountably infinite set. As can be inferred from these examples, there is often some leeway in how Ω is defined when setting up a probability model.

1.2 Events

A subset A ⊆ Ω is called an event—if the outcome of the experiment is an element of A, then we say that “event A has occurred.” For example, consider the two-dice example, together with the event A = { (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) }.

In words, A is the event in which “the sum of the spots showing on the two dice equals 7.” If, for example, the outcome of our dice experiment is (3, 4), then we say that event A has occurred since (3, 4) ∈ A. For our dart experiment, assuming that we measure coordinates in feet and the dartboard is 1′^ × 1 ′, we say that the event “the dart hits the lower half of the dartboard” occurs if the outcome is an element of the set

A = { (x, y) : 0 ≤ x ≤ 1 and 0 ≤ y ≤ 0. 5 }. (1.1)

This event is illustrated in Figure 1. In general, the event Ω is the (certain) event that “an outcome occurs” when we run the probabilistic experiment. At the other extreme, the “empty” event ∅ contains no elements, and corresponds to the (impossible) situation in which no outcome occurs. There is a close connection between set-theoretic notions and the algebra of events. The com- plimentary event Ac^ = Ω − A is the event in which A does not occur. The event A ∩ B is the event in which both A and B simultaneously occur, and the event A ∪ B is the event in which A occurs, or B occurs, or both. The event B − A = B ∩ Ac^ is the event in which B occurs and A does not occur. Events A and B are disjoint if they have no elements in common: A ∩ B = ∅. Intuitively, disjoint events can never occur simultaneously. For an experiment in which we roll one die and record the number of spots, so that Ω = { 1 , 2 , 3 , 4 , 5 , 6 }, the events A = { 1 , 3 , 5 } = “the die comes up odd” and B = Ac^ = { 2 , 4 , 6 } = “the die comes up even” are disjoint. If A ⊆ B, then the occurrence of A implies the occurrence of B.

(ii) P (Ac) = 1 − P (A).

(iii) P (A) ≤ P (B) whenever A ⊆ B.

(iv) (Boole’s inequality) P (

n An)^ ≤^

n P^ (An). (v) (Bonferroni’s inequality) P (

n An)^ ≥^1 −^

n P^ (Acn).

Observe that Boole’s inequality holds for any finite or countably infinite collection of events—the inequality becomes an equality if the events are mutually disjoint.

2.3 Probability Spaces

We refer to (Ω, P ), a sample space together with a probability measure on events (i.e., on subsets of Ω) as a probability space. A probability space is the mathematical formalization of a probabilistic experiment.^2 Some useful advice: whenever anybody starts talking to you about probabilities, always make sure that you can clearly identify the underlying probabilistic experiment and proba- bility space; if you can’t, then the problem is most likely ill-defined.

2.4 Independent Events

Events A and B are independent if P (A ∩ B) = P (A)P (B). In our experiment with two dice, suppose that each elementary outcome is equally likely. Then the events A = “the black die comes up even” and B = “the white die comes up even” are independent, since A ∩ B comprises 9 outcomes and each of A and B comprises 18 outcomes, so that P (A) = P (B) = 18/36 = 1/2 and P (A ∩ B) = 9/36 = 1/4. Events A 1 , A 2 ,... , An are mutually independent if

P (An 1 ∩ An 2 ∩ · · · ∩ Ank ) = P (An 1 )P (An 2 ) · · · P (Ank )

for 2 ≤ k ≤ n and 1 ≤ n 1 < n 2 < · · · < nk ≤ n. The events in a countably infinite collection are said to be mutually independent if the events in every finite subcollection are mutually independent.

3 Conditional Probability

3.1 Definition of Conditional Probability

Conditional probability formalizes the notion of “having information” about the outcome of a probabilistic experiment. Consider the two-dice example with equally likely outcomes. Let A = “the black die has an even number of spots” and B = “the sum of the spots showing on the two dice equals 4.” In the absence of any information about the outcome of the experiment, an observer (^2) As mentioned previously, when Ω is uncountably infinite there are in general certain weird sets A ⊂ Ω on which probabilities cannot be defined. Therefore, advanced texts on probability define a probability space as a triple (Ω, P, F), where F is the “σ-field” of subsets on which P is defined.

would estimate the (unconditional) probability of event A as 0.5, since 18 of the 36 possible outcomes are elements of A. Given the additional information that event B has occurred, the observer knows that the outcome is one of (1, 3), (2, 2) or (3, 1). Of these, only the outcome (2, 2) corresponds to the occurrence of event A. Since the outcomes (1, 3), (2, 2), and (3, 1) are equally likely, the observer would estimate P (A | B), the conditional probability that A has occurred given that B has occurred, as 1/3. In general, for events A and B such that P (B) > 0, we define

P (A | B) = P^ ( PA (^ B∩^ )B ). (3.1)

3.2 Independence, Law of Total Probability, Product Representation

Observe that our previous definition of independence for A and B is equivalent to requiring that P (A | B) = P (A). I.e., knowing that B has occurred does not change our assessment of the probability that A has occurred. Turning the definition in (3.1) around, we have the important relationship P (A ∩ B) = P (A | B)P (B). Observe that, by one of our basic ground rules, P (A) = P (A ∩ B) + P (A ∩ Bc) = P (A | B)P (B)+P (A | Bc)P (Bc). An important generalization of this result is the law of total probability: if B 1 , B 2 ,... , Bn are mutually disjoint events such that B 1 ∪ B 2 ∪ · · · ∪ Bn = Ω, then

P (A) = P (A | B 1 )P (B 1 ) + P (A | B 2 )P (B 2 ) + · · · + P (A | Bn)P (Bn). (3.2)

Another important result that follows from the basic definition of conditional probability asserts that, for any events A 1 , A 2 ,... , An,

P (A 1 ∩ A 2 ∩ · · · ∩ An) = P (A 1 )P (A 2 | A 1 )P (A 3 | A 1 ∩ A 2 ) · · · P (An | A 1 ∩ A 2 ∩ · · · ∩ An− 1 ). (3.3)

For example, suppose that we have n people in a room (where 2 ≤ n ≤ 365), and each person is equally like to have been born on any of the 365 days of the year (ignore leap years and other anomalies). For 1 ≤ i < n, let Ai = “the birthday of person (i + 1) is different from persons 1 through i. Then P (Ai | A 1 ∩ · · · ∩ Ai− 1 ) = (365 − i)/365 for 1 ≤ i < n. It follows from (3.3) that the probability of the event Bn = “at least two people in the room share a birthday” is

P (Bn) = 1 − P (A 1 ∩ · · · ∩ An− 1 ) = 1 −

n∏− 1

i=

i 365

This solution to the “birthday problem” is well known because it is somewhat counter-intuitive: for n = 23, the probability of at least one shared birthday exceeds 50%, and for n = 50 the probability exceeds 97%.

3.3 Bayes’ Rule

Bayes’ Rule can be viewed as formalizing the process of learning from observations or data. Let A 1 , A 2 ,... , An be a set of mutually exclusive and exhaustive events: Ai ∩ Aj = ∅ for i 6 = j and

Figure 2: pmf for random variable X = “sum of spots” in two-dice example.

for each subset A of the real numbers. When X is discrete, i.e., takes values in a finite or countably infinite set S = { a 1 , a 2 ,... }, the distribution of X is usually described in terms of the probability mass function (pmf), sometimes denoted by pX , where

pX (ai) = μ({ai}) = P { X = ai }

for ai ∈ S. Of course, a pmf pX must satisfy pX (ai) ≥ 0 for all i and ∑ i

pX (ai) = 1.

The set S is sometimes called the support of X. In the two-dice example with fair and independent dice, all of the elementary outcomes are equally likely, and we have, for example,

pX (4) = P { X = 4 } = P (A 4 ) = 3/ 36 ,

where A 4 = { ω ∈ Ω : X(ω) = 4 } = { (1, 3), (2, 2), (3, 1) }.

The complete pmf for X is displayed in Table 1 and plotted in Figure 2. The situation is usually more complicated for a continuous random variable X, i.e., a random variable taking values in an uncountably infinite set S such as an interval [a, b] of the real line. In this case, we typically have P { X = x } = 0 for any x ∈ S. The distribution of a continuous random variable X is often described in terms of the probability density function (pdf), sometimes denoted by fX. Roughly speaking, for x ∈ S and a small increment ∆x > 0, we take the quantity fX (x)∆x as the approximate probability that X takes on a value in the interval [x, x + ∆x]. More precisely, for a subset A ⊆ S, we have

μ(A) = P { X ∈ A } =

A

fX (x) dx.

In analogy with a pmf, a pdf must satisfy fX (x) ≥ 0 for all x and ∫ (^) ∞

−∞

fX (x) dx = 1.

Figure 3: The pdf and cdf for a U [0, 1] random variable U.

Figure 4: The pmf and cdf for a random variable X with pX (0.5) = 0.3, pX (1.5) = 0.5, and pX (3) = 0.2.

As an example, consider a random variable U having the “uniform distribution on [0, 1],” abbrevi- ated U [0, 1]. Here U is equally likely to take on any value between 0 and 1. Formally, we set

fU (x) =

0 if x < 0; 1 if 0 ≤ x < 1; 0 if x ≥ 1

for all real values of x.^3 Then, e.g., for A = [0. 25 , 0 .75], we have

P { U ∈ A } = P { 0. 25 ≤ U ≤ 0. 75 } = μ(A) =

A

fU (x) dx =

  1. 25

1 dx = x

0. 25 = 0.^5.

For either a discrete or continuous random variable X satisfying the regularity condition^4 P { −∞ < X < ∞ } = 1, the right-continuous function FX defined by

FX (x) = P { X ≤ x } = μ

(−∞, x]

for real-valued x is the cumulative distribution function (cdf) of X. The function FX is nonde- creasing, with FX (−∞) = 0 and FX (∞) = 1. For a continuous random variable X, the cdf is (^3) We have defined fU (x) at the points x = 0 and x = 1 so that fU is right-continuous; this is a standard convention. Changing the definition at these two points has no real effect, since P { U = x } = 0 for any single point x. (^4) Such a random variable is often called proper. We restrict attention throughout to proper random variables.

generalization of our results to three or more random variables being obvious.

5.1 Joint, Marginal, and Conditional Distributions

We can describe the joint probability distribution of X and Y by means of the joint cdf FX,Y , where FX,Y (x, y) = P { X ≤ x, Y ≤ y }.^5 If X and Y are both discrete, we can also describe the joint distribution using the joint pmf pX,Y (x, y) = P { X = x, Y = y }. If X and Y are both continuous, then (assuming differentiability) we can describe the joint distribution using the joint pdf defined by fX,Y = dFX,Y /dx dy. Given a joint distribution of X and Y , we might be interested in the marginal distribution of X alone. The marginal cdf of X is defined in terms of the joint cdf by

FX (x) = P { X ≤ x } = P { X ≤ x, Y < ∞ } = FX,Y (x, ∞).

If X and Y are both discrete or both continuous, then the marginal pmf or marginal pdf of X can be computed from the corresponding joint distribution:

pX (x) = P { X = x } =

y

pX,Y (x, y) or fX (x) =

−∞

fX,Y (x, y) dy.

Of course, we can define quantities FY (y), pY (y), and fY (y) in an analogous manner to obtain the marginal distribution of Y. For discrete random variables X and Y , we can define the conditional pmf of X, given that Y = y, by adapting our general definition of conditional probability:

pX|Y (x|y) = P { X = x | Y = y } = P { X = x, Y = y } P { Y = y } =^

pX,Y (x, y) pY (y).

If X and Y are both continuous, then a natural definition for a conditional pdf is

fX|Y (x|y) = fX,Y (x, y) fY (y). We conclude this section by giving a continuous analogue of Bayes’ Rule (3.4):

fX|Y (x|y) = ∫ fY^ |X^ (y|x)fX^ (x) ∞ −∞ fY^ |X^ (y|x′)fX^ (x′)^ dx′^

This rule shows how to update the prior density fX of the random variable X to a posterior density fX|Y given the observation Y = y. The conditional density fY |X is called the likelihood of Y , given the value of X. For example, if X has a Beta(α, β) distribution (see Sec. 7.5 below) and Y has a Binom(n, X) distribution (see Sec. 7.7), then

fX|Y (x|y) =

(n y

xy+α−^1 (1 − x)n−y+β−^1 /B(α, β) ∫ (^1) 0

(n y

zy+α−^1 (1 − z)n−y+β−^1 dz/B(α, β)

xy+α−^1 (1 − x)n−y+β−^1 B(y + α, n − y + β).

That is, fX|Y is the pdf of a Beta(α + Y, β + n − Y ) distribution. Because the prior and poste- rior distributions belong to the same family, they are called conjugate distributions, and the beta distribution is called a conjugate prior for the binomial distribution. (^5) We use notation such as P { X ≤ x, Y ≤ y } instead of the more cumbersome notation P ({X ≤ x} ∩ {Y ≤ y}).

5.2 Independent Random Variables

The real-valued random variables X 1 , X 2 ,... , Xn are mutually independent if

P { X 1 ∈ A 1 , X 2 ∈ A 2 ,... , Xn ∈ An } = P { X 1 ∈ A 1 } P { X 2 ∈ A 2 } · · · P { Xn ∈ An }

for every collection of subsets A 1 , A 2 ,... , An of the real line. To establish independence, it suffices to show that

P { X 1 ≤ x 1 , X 2 ≤ x 2 ,... , Xn ≤ xn } = P { X 1 ≤ x 1 } P { X 2 ≤ x 2 } · · · P { Xn ≤ xn }

for every set of real numbers x 1 , x 2 ,... , xn, i.e., it suffices to show that the joint cdf factors. If X 1 , X 2 ,... , Xn are all discrete or all continuous, then it suffices to show that the joint pmf or joint pdf factors: pX 1 ,X 2 ,...,Xn (x 1 , x 2 ,... , xn) = pX 1 (x 1 )pX 2 (x 2 ) · · · pXn (xn)

or fX 1 ,X 2 ,...,Xn (x 1 , x 2 ,... , xn) = fX 1 (x 1 )fX 2 (x 2 ) · · · fXn (xn).

A countably infinite collection of random variables is said to be mutually independent if the random variables in each finite subcollection are independent. Observe that X and Y are independent if and only if the conditional pmf or pdf equals the marginal pmf or pdf for X.

6 Expectation

6.1 Definition

The expected value (aka the mean) of a random variable X, denoted E [X], is one way of roughly indicating where the “center” of the distribution of X is located. It also has the interpretation of being the average value of X over many repetitions of the probabilistic experiment. The idea is to weight each value of X by the probability of its occurrence. If X is discrete, then we define

E [X] =

ai∈S

aipX (ai),

and if X is continuous, then we define

E [X] =

−∞

xfX (x) dx.

If X is a mixed random variable, then we compute the expected value by combining summation and integration. For example, consider the random variables U , X, and Y whose distributions are displayed in Figures 3 through 5. The expectations for these random variables are computed as

E [U ] =

−∞

xfU (x) dx =

0

x dx = x

2 2

1 0

That is, W (u, v) is the absolute value of the determinant of the “Jacobian” matrix

J(u, v) =

( (^) ∂ginv(u,v) ∂u

∂ginv(u,v) ∂h^ ∂v inv ∂u(u,v) ∂hinv ∂v(u,v)

In the special case where u = g(x), v = h(y), and A = [a 1 , a 2 ] × [b 1 , b 2 ], we have x = ginv(u) and y = hinv(v), and the above formula specializes to ∫ (^) a 2

a 1

∫ (^) b 2

b 1

φ(x, y) dx dy =

∫ (^) g(a 2 )

g(a 1 )

∫ (^) h(b 2 )

h(b 1 )

φ

ginv(u), hinv(v)

g′ inv(u)h′ inv(v) du dv.

6.2 Basic Properties

We now state some basic properties of the expectation operator. Here and elsewhere, a property of a random variable is said to hold almost surely (a.s.) if it holds with probability 1. If X and Y are random variables and a and b are real-valued constants, then (i) X is integrable if and only if E [|X|] < ∞. (ii) If X = 0 a.s., then E [X] = 0. (iii) If X and Y are integrable with X ≤ Y a.s., then E [X] ≤ E [Y ]. (iv) If X and Y are integrable, then E [aX + bY ] = aE [X] + bE [Y ]. (v) |E [X] | ≤ E [|X|]. (vi) If X and Y are independent, then E [XY ] = E [X] E [Y ].

Note that, in general, E [XY ] = E [X] E [Y ] does not imply that X and Y are independent.

6.3 Moments, Variance, Standard Deviation

The rth moment of a random variable X is E [Xr] and the rth central moment is E [(X − μ)r], where μ = E [X]—the second central moment is called the variance of X and denoted Var [X]. It is easy to show that Var [X] = E[X^2 ] − μ^2.

The standard deviation of X is defined as the square root of the variance: Std [X] = Var^1 /^2 [X]. For real numbers c and d, we have Var [cX + d] = c^2 Var [X] and Std [cX + d] = |c| Std [X]. The variance and standard deviation measure the degree to which the probability distribution of X is concentrated around its mean μ = E [X]. When the variance or standard deviation equals 0, then X = μ with probability 1. The larger the variance or standard deviation, the greater the probability that X can take on a value far away from the mean. We frequently talk about moments of distributions, rather than random variables. E.g., the mean μ and variance σ^2 of a random variable X having cdf F and pdf f are given by

μ =

−∞

xf (x) dx and σ^2 =

−∞

(x − μ)^2 f (x) dx.

Figure 6: Definition of a convex function.

We may refer to μ and σ^2 as the “mean and variance of X,” the “mean and variance of f ,” or “mean and variance of F .” Sometimes one may see the notation

μ =

−∞

x dF and σ^2 =

−∞

(x − μ)^2 dF

used for these quantities; this notation is mostly used to refer to situations in which X is a mixed random variable, so that F has some discontinuities. In this case, moments are computed by a combination of integration (over the intervals where F is differentiable) and summation (over the discontinuity points). E.g., for the mixed random variable Y in Figure 5, we have E [Y ] = (0.5)(0.2) + (2)(0.1) +

  1. 0 0.^6 y dy^ +^
  1. 5 0.^8 y dy^ = 1.775.

6.4 Identities and Inequalities

There are many identities and inequalities for moments of random variables—a very useful inequal- ity for our purposes is H¨older’s inequality: let X and Y be random variables, and let p and q be constants such that 1 < p < ∞ and 1/p + 1/q = 1. Then

E [|XY |] ≤ E^1 /p^ [|X|p] E^1 /q^ [|Y |q].

Take p = q = 2 to obtain the Cauchy–Schwarz inequality:

E [|XY |] ≤ E^1 /^2

[

X^2

]

E^1 /^2

[

Y 2

]

In particular, E^2 [X] ≤ E

[

X^2

]

—take Y ≡ 1 and use the fact that E [X] ≤ E [|X|]. Next, fix 0 < α ≤ β and take X = |Z|α, Y ≡ 1, and p = β/α in H¨older’s inequality to obtain

E^1 /α^ [|Z|α] ≤ E^1 /β^

[

|Z|β^

]

X

Y

X

Y

X

Y

X

Y

(a) (b)

(c) (d)

rX,Y » 1

rX,Y » 0 rX,Y » 0

rX,Y » - 1

Figure 7: Samples of (X, Y ) pairs from distributions with various correlation values.

between X and Y , whereas a value of ρX,Y close to 0 indicates the absence of a discernable linear relationship; see Figure 7, which plots some samples from the joint distribution of (X, Y ) under several different scenarios. Note that ρX,Y is close to 0 in Figure 7(d), even though there is a strong relationship between X and Y —the reason is that the relationship is nonlinear. It follows from basic property (vi) of expectation that if X and Y are independent, then Cov [X, Y ] = 0; the converse assertion is not true in general. Some simple algebra shows that

Var

[ (^) ∑n

i=

Xi

]

∑^ n i=

Var [Xi] +

i 6 =j

Cov [Xi, Xj ] ,

so that

Var

[ (^) ∑n

i=

Xi

]

∑^ n i=

Var [Xi]

if X 1 , X 2 ,... , Xn are mutually independent.

7 Some Important Probability Distributions

In this section we discuss some probability distributions that play a particularly important role in our study of simulation; see Tables 6.3 and 6.4 in the textbook for more details. We start with several continuous distributions.

7.1 Uniform Distribution

We have already discussed the U [0, 1] distribution. In general, the distribution of a random variable that is equally likely to take on a value in a subinterval [a, b] of the real line is called the uniform distribution on [a, b], abbreviated U [a, b]. The pdf and cdf for a U [a, b] random variable U are given by

fU (x) =

0 if x < a; 1 /(b − a) if a ≤ x < b; 0 if x ≥ b

and FU (x) =

0 if x < a; (x − a)/(b − a) if a ≤ x ≤ b; 1 if x > b.

If U is a U [0, 1] random variable, then V = a + (b − a)U is a U [a, b] random variable. The easiest way to prove this assertion (and many others like it) is to work with the cdf of each random variable:

FV (x) = P { V ≤ x } = P { a + (b − a)U ≤ x } = P { U ≤ (x − a)/(b − a) } = FU

(x − a)/(b − a)

By inspection, FU

(x − a)/(b − a)

coincides with the function FU in (7.1). The pdf fU is then obtained from FU by differentiation. The mean and variance of the U [a, b] distribution are (a + b)/ 2 and (b − a)^2 /12.

7.2 Exponential Distribution

The pdf and cdf of an exponential distribution with intensity λ, abbreviated Exp(λ), are

f (x) =

0 if x < 0; λe−λx^ if x ≥ 0

and F (x) =

0 if x < 0; 1 − e−λx^ if x ≥ 0

This distribution is also called the “negative exponential distribution.” The mean and variance are given by 1/λ and 1/λ^2. A key feature of this distribution is the “memoryless” property: if X is an Exp(λ) random variable and u, v are nonnegative constants, then

P { X > u + v | X > u } = P^ {^ X > u^ +^ v^ } P { X > u }

F¯ (u + v) F¯ (u) =^

e−(u+v) e−u^ = e−v^ = P { X > v } ,

where, as before F¯ = 1 − F. E.g., suppose that X represents the waiting time until a specified event occurs. Then the probability that you have to wait at least v more time units is independent of the amount of time u that you have already waited; at time t = u, the “past” has been forgotten with respect to estimating the probability distribution of the remaining time until the event occurs. Another important property of the exponential distribution concerns the distribution of Z = min(X, Y ), where X is Exp(λ 1 ), Y is Exp(λ 2 ), and X and Y are independent. The distribution of Z can be computed as follows:

P { Z > z } = P { min(X, Y ) > z } = P { X > z, Y > z } = P { X > z } P { Y > z } = e−λ^1 z^ e−λ^2 z^ = e−(λ^1 +λ^2 )z^.

N0,1

Student t 2 d.o.f.

Figure 8: pdf for standard normal distribution snd Student t distribution with 2 degrees of freedom.

for x ≥ 0 and f (x; α, λ) = 0 for x < 0. Here Γ(α) is the gamma function, defined by Γ(α) = ∫ (^) ∞ 0 x α− (^1) e−x (^) dx and satisfying Γ(α) = (α − 1)! whenever α is a postive integer. The mean and

variance of the distribution are α/λ and α/λ^2.

7.5 Beta Distribution

The pdf of the Beta distribution with parameters α and β, abbreviated Beta(α, β), is given by

f (x; α, β) = x

α− (^1) (1 − x)β− 1 B(α, β)

if ∫ x ∈ [0, 1], and f (x; α, β) = 0 otherwise. Here B(α, β) is the beta function, defined by B(α, β) = 1 0 x α− (^1) (1 − x)β− (^1) dx and satisfying B(α, β) = Γ(α)Γ(β)/Γ(α + β). The mean and variance of the

distribution are α/(α + β) and αβ(α + β)−^2 (α + β + 1)−^1.

7.6 Discrete Uniform Distribution

We now discuss some important discrete distributions. A discrete uniform random variable X with range [n, m], abbreviated DU [n, m], is equally likely to take on the values n, n + 1,... , m, i.e., pX (k) = 1/(m − n + 1) for k = n, n + 1,... , m. The mean and variance are given by (n + m)/2 and [(m − n + 1)^2 − 1]/12. If U is a continuous U [0, 1] random variable, then V = bn + (m − n + 1)U c is DU [n, m], where bxc is the largest integer less than or equal to x. Here’s a proof: fix k ∈ { n, n + 1,... , m } and observe that

P { V = k } = P { bn + (m − n + 1)U c = k } = P { k ≤ n + (m − n + 1)U < k + 1 }

= P

k − n m − n + 1 ≤ U < k^ −^ n^ + 1 m − n + 1

= (^) m −^1 n + 1.

7.7 Bernoulli and Binomial Distributions

The Bernoulli distribution with parameter p, abbreviated Bern(p), has pmf given by p(1) = 1 − p(0) = p. That is, a Bern(p) random variable X equals 1 with probability p and 0 with probability 1 − p. Often, X is interpreted as an indicator variable for a “Bernoulli trial with success probability p.” Here X = 1 if the trial is a “success” and X = 0 if the trial is a “failure.” The mean and variance are given by p and p(1 − p). The number of successes Sn in n independent Bernoulli trials, each with success probability p, can be represented as Sn = X 1 + X 2 + · · · + Xn, where X 1 , X 2 , · · · , Xn are independent Bern(p) random variables. The random variable Sn has the binomial distribution with parameters n and p, abbreviated Binom(n, p). The pmf is given by

p(k) =

n k

pk(1 − p)n−k

for k = 0, 1 ,... , n. The mean and variance are given by np and np(1 − p).

7.8 Geometric Distribution

The geometric distribution with parameter p, abbreviated Geom(p), has support on the nonnegative integers and a pmf and cdf given by

p(k) = p(1 − p)k^ and F (k) = 1 − (1 − p)k+

for k = 0, 1 , 2 ,.. ., where p ∈ [0, 1]. The mean and variance are given by (1 − p)/p and (1 − p)/p^2. Observe that, if X is Geom(p), then

P { X ≥ m + n | X ≥ m } = P^ {^ X^ ≥^ m^ +^ n^ } P { X ≥ m }

F¯ (m + n − 1) F¯ (m − 1) = (1 − p)m+n (1 − p)m^ = (1^ −^ p)

n (^) = P { X ≥ n } ,

so that the Geom(p) distribution has a memoryless property analogous to that of the exponen- tial distribution. Indeed, the geometric distribution can be viewed as the discrete analog of the exponential distribution.

7.9 Poisson Distribution

The Poisson distribution with parameter λ, abbreviated Poisson(λ), has support on the nonnegative integers and a pdf given by

p(k) = e

−λλk k! for k = 0, 1 , 2 ,.. ., where λ > 0. The mean and variance are both equal to λ.