






























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
What is “mathematical” statistics (as opposed to “regular” statistics)? The goal is to develop mathematical theory (theorems, approximations, etc.) for why statistical procedures work (or not)... this, in turn, leads to better procedures
Typology: Essays (high school)
1 / 38
This page cannot be seen from the preview
Don't miss anything!































What is statistics?
Examples:
What is “mathematical” statistics (as opposed to “regular” statistics)? The goal is to develop mathematical theory (theorems, approximations, etc.) for why statistical procedures work (or not)... this, in turn, leads to better procedures. The material that we’ll go over this semester mostly dates back to early part of last century. Since then, mathematical statistics has exploded, with 103 people working on developing theory, and many (many) more using these techniques.
Statistical inference corresponds to an inverse problem: given data, we want to answer questions about the “true” underlying state of the world, e.g., the true parameter indexing the distribution that gave rise to our observed data. Estimation is about choosing between a contin- uum of possible parameters. We’ll apply ideas from decision theory to talk about how to decide between different “estimators” (that is, rules for estimating parameters from data). We’ll also talk about the concept of “sufficiency,” or “data reduction”: that is, how to decide which aspects of the data matter for inference, and which aspects can be safely ignored. The main focus, again, will be on how to think about estimation prob- lems asymptotically:
Testing and estimation are two sides of the same coin. Whereas esti- mation was about choosing one parameter from many, testing is about dividing the parameters into two sets, then deciding between these groups. So testing could be considered a special case of estimation, but it’s special enough (and comes up often enough in practice) to warrant its own discussion and techniques.
We start with data. Someone (maybe you) has done an experiment, made observations of some stochastic process, and collected some data. These data take values in some “sample space” Ω: this is the set of all things that can possibly happen, no matter how crazy. We define an “event” as some set of points in sample space, i.e. a subset of what is possible. E.g.,
The “probability” of an event is a fairly intutive concept. We can think of probability in at least two (non-exclusive) ways:
Let’s make this more mathematically precise. We say a probability func- tion is a scalar function on sets (i.e., you plug in a given set A ⊂ Ω, and you get a single number out) if the function satisfies three conditions:
positivity : P (A) ≥ 0
normalization : P (Ω) = 1
(sub-)additivity : Ai ∩ Aj = Ø ∀i %= j =⇒ P (∪i Ai ) =
i
P (Ai )
These conditions have some important implications, which you should check for yourself: P (Ac^ ) = 1 − P (A) P (Ø) = 0 (^1) See HMC 1.1-1.3. The abbreviation “HMC” refers to the text we’re using, Introduction to Mathematical Statistics, by Hogg, McKean, and Craig, 6th edition.
The simplest case of this is the “flat” probability function: all events of equal “size” have equal P. (Of course, figuring out which events have equal “size” can be a matter of some subjectivity.) The usual examples of sample spaces equipped with flat probability func- tions:
The key point here is that computing the probability of any event you can think of here just comes down to counting:
#ω ∈ A #ω ∈ Ω
(Of course, life is not always fair - sometimes some equally-sized events of happen more often than others, as we’ll see repeatedly...)
Example. Imagine a disease occurs with 0.1 % frequency in the population. Now let’s say there’s a blood test that comes back positive with 99 % probability if the disease is present (i.e., 99 % correct), but 2 % correct if not (i.e., 2 % false alarm rate). What is the conditional probability that someone has the disease, given a positive test result? Let’s translate this into mathematical language. Event A = “disease positive.” Event B = “test positive.” We want P (A|B), and we’ve been given P (A) and P (B|A). We use Bayes: P (A|B) = P (A ∩ B)/P (B) = P (B|A)P (A)/P (B) We just need
P (B) = P (B|A)P (A) + P (B|Ac^ )P (Ac^ ) = 0. 99 · 0 .001 + 0. 02 · 0 .999 = 0. 021.
So
P (A|B) = P (B|A)P (A)/P (B) = 0. 99 · 0. 001 / 0 .021 = 0. 047 ,
i.e., less than 5 %.
This answer seems surprisingly low given that the test is fairly reliable. The explanation of this possibly counterintutive result is that the prior prob- ability of the disease is so small that observing the data of the blood test only perturbs this probability slightly. We’ll see another example of a possibly counterintuitive conditional prob- ability in the homework. (Exercise 1: Mr. Hall’s doors, exercise 1.4.30 in HMC.)
Independence may be considered the single most important concept in prob- ability theory, demarcating the latter from measure theory and fostering and independent development. In the course of this evolution, probability theory has been fortified by its links with the real world, and indeed the definition of independence is the abstract counterpart of a highly intuitive and empirical notion.
Chow and Teicher, 1978
Now for a concept which is again very intuitive, but turns out to have some incredibly deep implications. We say two events A and B are “independent” if seeing A tells us nothing about B, and vice versa. More mathematically, assume P (A), P (B) > 0. Then A and B are independent events if
thus,
so therefore
P (B|A) = P (B),
too.
Continuous r.v.’s, on the other hand, can take values in a continuum, and the probability of any specific outcome is taken to be zero. Examples:
The cumulative distribution function can be defined as above:
F (u) = P (X(ω) ≤ u).
Note that this is a monotonically increasing, continuous function. Also, we have the basic formula
P (a < X(ω) < b) = F (b) − F (a).
However, the pmf makes less sense, because for continuous r.v.’s, by def- inition, P (X = u) = 0 ∀u. Instead we’ll define the “probability density function,” or pdf, as any function p(u) ≥ 0 such that
∫ (^) ∞
−∞
f (u)du = 1
and
P (X ∈ A) =
A
f (u)du.
A function of an r.v. is another r.v. (I.e., if X(ω) is an r.v., then so is g(X(ω)), for any real function g(.).) In fact, as we will see soon, passing an r.v. through some function is a very convenient and general way to come up with other r.v.’s. The natural question is, if we know the distribution of X, how do we get the distribution of g(X)? In the discrete case, this is pretty straightforward. Let’s say g sends X to one of m possible discrete values. (Note that g(X) must be discrete whenever X is.) To get p (^) g (i), the pmf of g at the i-th bin, we just look at the pmf at any values of X that g mapped to i. More mathematically,
p (^) g (i) =
j∈g −^1 (i)
p (^) X (j).
Here the “inverse image” g −^1 (A) is the set of all points x satisfying g(x) ∈ A. Things are only slightly more subtle in the continuous case. We’ll talk about two methods: one based on cdf’s and one based on pdf’s. The cdf of g(X) is just the probability that g(X) is less than or equal to u. Therefore we just need to look at P (X ∈ g −^1 (y : y ≤ u)). This is often straightforward to do. In the second case we deal with densities instead. Here there’s one slight twist that’s most easily explained if we think of a very simple scaling trans- formation: think about the density of g(X) = cX, where c is a constant and X ∼ p (^) X (x). Applying the logic of the discrete approach above, we might think that the density of Y = g(X) would be
p (^) Y = p (^) X (g −^1 (Y )) = p (^) X (c −^1 Y ). But this is wrong,
as we see when we try to integrate p (^) Y , since ∫ (^) ∞
−∞
p (^) Y dy = |c|.
So the correct density is
p (^) Y =
|c|
p (^) X (c −^1 Y ).
(^4) HMC 1.6.1, 1.7.
Note that we can talk about more than one r.v. at once. Obviously we can simultaneously define as many functions on the sample space as we want. The natural question is, how are these r.v.’s related? Can one r.v. tell us anything about another? It’s helpful to define a couple new concepts here. First, the “joint distri- bution” of two r.v.’s X and Y is defined as
F (u 1 , u 2 ) = P (X(ω) ≤ u 1 ∩ X(ω) ≤ u 2 ).
This is like a pairwise cdf. (It’s obvious how to generalize this to more than two r.v.’s at once.) We can also define joint pmf’s and pdf’s: for example, a joint pdf is a function f (u 1 , u 2 ) ≥ 0 such that
∫ (^) ∞
−∞
−∞
f (u 1 , u 2 )du 1 du 2 = 1
and
P ({X, Y } ∈ A) =
A
f (u 1 , u 2 )du 1 du 2.
Now we can talk about how much X tells us about Y , because the in- dependence of sets has a simple analog in the independence of r.v.’s. We say X and Y are independent if their joint pdf can be written as a product function, f (u 1 , u 2 ) = f 1 (u 1 )f 2 (u 2 ).
More generally (since not every r.v. has a nice pdf), X and Y are independent if P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).
R.v.’s which are not independent are called (naturally enough) “dependent” — but note that this is a mathematical definition, and that mathematically dependent r.v.’s may in fact not be directly coupled in any mechanistic way. We can also talk about conditional probabilities related to r.v.’s: the conditional probability distribution of X given Y is
F (u|Y ∈ A) = P (X ≤ u|Y ∈ A). (^5) HMC 2.1-2.
Again, if F (u|Y ∈ A) "= F (u)
for some set A with positive probability P (Y ∈ A), then Y tells us something about X. Conditional densities and pmf’s may be defined in the natural way from the relevant distribution functions (although sometimes we might run into problems defining conditional densities, since the process involves a limit that does not always exist). We can also invert the process. Let’s say we’re handed the joint distri- bution of X and Y , but we really only care about X. How do we get P (X)? We have to use the summation rule for probability,
P (X) =
i
P (X ∩ {Y = i}).
I.e., for continuous r.v.’s, we have
F (u 1 ) =
−∞
du 2
∫ (^) u
−∞
f (u 1 , u 2 )du 1.
For reasons which are not entirely clear to me, this process of integrating over the r.v. we don’t care about is called “marginalization.” So F (u) is the “marginal” (or prior) distribution of X, to distinguish it from the conditional (or posterior) distribution given Y.
names.) It’s common to think of the mean as the “center of gravity” of a pdf. The variance, on the other hand, is defined to measure the “spread” of the pdf, or how variable the underlying r.v. is. It’s defined in terms of second moments:
V (f ) =
−∞
f (u)(u − E(u))^2 = E 2 (f ) − E 1 (f )^2.
Note that the variance is zero only if the r.v. is constant (i.e., not at all variable), and always nonnegative. Also, V (cX) = c 2 V (X),
for any constant c. For this reason, people often use the “standard deviation” instead, that is, σ(X) ≡
since σ(cX) = |c|σ(X) and is therefore slightly more useful as a measure of the “scale” of X. We’ll mention a couple other important expectations soon. For now, it’s worth noting that the probability function associated with a random variable X, P (X ∈ A), is itself a kind of expectation, if we define the random variable
I (^) A (X) = 1(X ∈ A);
then P (X ∈ A) = E(I (^) A ).
So we can go back and forth between probability and expectation as desired. One last thing worth remembering: expectations don’t always exist, or they can be infinite. We’ll see examples of this soon, but the reason is clear: sometimes the integral in the definition of the expectation is either infinite or fails to converge. This situation actually does come up in practice: for example, the “fat-tailed” distributions that people sometimes use in financial models often have infinite moments.
We can also define conditional expectations in the obvious way: the condi- tional expectation of X given Y is
E(X|Y ) =
−∞
xp(X = x|Y )dx.
Note the important formula
EY (EX (X|Y )) = E(X),
which follows by an interchange of expectations (as always, under the proviso that the relevant expectations exist). If we recall the link between probabilities and expectations, we can also define independence of r.v.’s in terms of expectations. For example, X and Y are independent if
E(f (X)g(Y )) = E(f (X))E(g(Y )),
for any functions f and g s.t. the relevant expectations exist. Using the same logic, X and Y are independent if
E(f (X)|Y ) = E(f (X)),
for any function f s.t. the relevant expectations exist. (Note that E(X|Y ) = E(X) is insufficient for independence. Exercise 5: Give an example of dependent X and Y such that E(X|Y ) = E(X).)
Perhaps the most important transformation we’ll encounter is multidimen- sional: if X ∼ p (^) X (x) and Y ∼ p (^) Y (y) are independent, what is the distribution of X + Y? The easiest way to approach this is to think about the relevant cdf’s.
P (X + Y ≤ u) =
X+Y ≤u
P (X, Y )dXdY
−∞
dX
∫ (^) u−X
−∞
P (X, Y )dY =
−∞
dY
∫ (^) u−Y
−∞
P (X, Y )dX
−∞
dx
∫ (^) u−x
−∞
p (^) X (x)p (^) Y (y)dy (by independence)
Exercise 8: Use the second formula above to prove, finally, that E(X + Y ) = E(X) + E(Y ), even if X and Y are dependent. (If you don’t feel like using the above formula, prove this using any method you want.) It’s worth noting that we can generalize this approach very easily, e.g., if we want to know the distribution of X · Y , or X/Y ; the only things that change in the above formula are the sets we are integrating over. Here are some examples for practice: Exercise 9: Assuming X ∼ p (^) X (x) and Y ∼ p (^) Y (y), with X and Y independent, what is the distribution of X · Y? Exercise 10: What is the distribution of max Xi , if N r.v.’s Xi are drawn i.i.d. from p (^) X (x)? Exercise 11: What is the distribution of X(j) , the j-th smallest Xi? (This is called the j-th “order statistic” of the sample. Order statistics are helpful summaries of data; for example, the median — the i/2- th order sample — or the range — X(N ) − X(1) .) Exercise 12: What is the distribution of the range? If we differentiate this w.r.t. u we get the pdf:
p (^) X+Y (u) =
−∞
p (^) X (x)p (^) Y (u − x)dx.
This expression is common enough (not just in statistics, but also in physics, engineering, etc.) that it has its own name: we say h(u) is the “convolution” of functions f (u) and g(u) if
h(u) = f ∗ g(u) ≡
−∞
f (x)g(u − x)dx =
−∞
f (u − x)g(x)dx.
(^8) HMC 1.
Note that convolution is a smoothing operation; roughness in f or g gets averaged over in h. In fact, you can prove that h is always at least as differen- tiable (in the sense of having k smooth derivatives) as the most differentiable of f and g. (Exercise 13: Prove this yourself.)
Here we’re going to introduce a trick that turns out to be more helpful than it would seem to have any right to be. Let’s look at the expectation of exponentials of a r.v. X:
M (^) X (s) = E(esX^ ) =
−∞
esu^ p (^) X (u)du.
Or for multiple r.v.’s simultaneously,
M (^) X 1 ,X 2 (s 1 , s 2 ) = E(es^1 X^1 +s^2 X^2 ) =
−∞
−∞
es^1 u^1 +s^2 u^2 p (^) X 1 ,X 2 (u 1 , u 2 )du 1 du 2.
These expectations are called “moment generating functions,” abbrevi- ated mgf’s. Why? Because
∂ i ∂s i^
M (^) X (s)
s=
= Ei (X),
which is a handy trick if the distribution p (^) X is complicated but M (^) X is simple. (Often derivatives are easier to compute than integrals.) A similar trick works in higher dimensions: Exercise 14: How would you compute the covariance of X and Y in terms of M (^) X,Y? Another nice feature of the mgf stems from the fact that 1) it’s easy to deal with products of exponentials, and 2) independent r.v.’s have product distributions. This, in turn, means that
M (^) X 1 ,X 2 ,... (s 1 , s 2 , ...) =
i
M (^) X (^) i (s (^) i )
if and only if the r.v.’s Xi are independent. This immediately tells us something interesting about convolutions and the distribution of sums of independent r.v.’s:
M (^) X+Y (s) = E(es(X+Y^ )^ ) = E(esX+sY^ )^ ) = E(esX^ )E(esY^ )^ ) = M (^) X (s)M (^) Y (s).