Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Statistics Introduction - Essay - Mathematics, Essays (high school) of Mathematics

Yale University Mathematics

What is “mathematical” statistics (as opposed to “regular” statistics)? The goal is to develop mathematical theory (theorems, approximations, etc.) for why statistical procedures work (or not)... this, in turn, leads to better procedures

Typology: Essays (high school)

2011/2012

Uploaded on 03/05/2012

tarquin 🇺🇸

4.3

(15)

260 documents

1 / 38

This page cannot be seen from the preview

Don't miss anything!

Paninski, Intro. Math. Stats., October 5, 2005 2

Introduction

What is statistics?

•inference from data

•building models of the world

•optimal prediction

Examples:

•mind-reading

•beating the stock market

•deciding whether a coin is fair or not

What is “mathematical” statistics (as opposed to “regular” statistics)?

The goal is to develop mathematical theory (theorems, approximations,

etc.) for why statistical procedures work (or not)... this, in turn, leads to

better procedures.

The material that we’ll go over this semester mostly dates back to early

part of last century. Since then, mathematical statistics has exploded, with

103people working on developing theory, and many (many) more using these

techniques.

Outline

1. Probability basics:

Fundamental idea: probability distributions, parameterized by some

finite number of parameters, serve as good models for observed data.

We’ll introduce a small zoo of useful probability distributions, and

talk about some useful limit theorems, the jewels of probability theory,

which permit us to make some extremely useful asymptotic simplifica-

tions of the theory, in the limit of lots of data.

2. Decision theory talks about how to behave optimally under uncertainty;

this will provide us with a framework for deciding which statistical

procedures are “best,” or at least better than others.

Discover Essays (high school) of Mathematics Yale University

Partial preview of the text

Download Statistics Introduction - Essay - Mathematics and more Essays (high school) Mathematics in PDF only on Docsity!

Introduction

What is statistics?

inference from data
building models of the world
optimal prediction

Examples:

mind-reading
beating the stock market
deciding whether a coin is fair or not

What is “mathematical” statistics (as opposed to “regular” statistics)? The goal is to develop mathematical theory (theorems, approximations, etc.) for why statistical procedures work (or not)... this, in turn, leads to better procedures. The material that we’ll go over this semester mostly dates back to early part of last century. Since then, mathematical statistics has exploded, with 103 people working on developing theory, and many (many) more using these techniques.

Outline

Probability basics: Fundamental idea: probability distributions, parameterized by some finite number of parameters, serve as good models for observed data. We’ll introduce a small zoo of useful probability distributions, and talk about some useful limit theorems, the jewels of probability theory, which permit us to make some extremely useful asymptotic simplifica- tions of the theory, in the limit of lots of data.
Decision theory talks about how to behave optimally under uncertainty; this will provide us with a framework for deciding which statistical procedures are “best,” or at least better than others.

Parameter estimation:

Statistical inference corresponds to an inverse problem: given data, we want to answer questions about the “true” underlying state of the world, e.g., the true parameter indexing the distribution that gave rise to our observed data. Estimation is about choosing between a contin- uum of possible parameters. We’ll apply ideas from decision theory to talk about how to decide between different “estimators” (that is, rules for estimating parameters from data). We’ll also talk about the concept of “sufficiency,” or “data reduction”: that is, how to decide which aspects of the data matter for inference, and which aspects can be safely ignored. The main focus, again, will be on how to think about estimation prob- lems asymptotically:

How do we decide if an estimator is “consistent,” that is, the estimator gives you the right answer if you observe enough data?
How do we decide how “efficient” an estimator is? Is there a natural sense of optimal efficiency?

Hypothesis testing (aka classification):

Testing and estimation are two sides of the same coin. Whereas esti- mation was about choosing one parameter from many, testing is about dividing the parameters into two sets, then deciding between these groups. So testing could be considered a special case of estimation, but it’s special enough (and comes up often enough in practice) to warrant its own discussion and techniques.

Basics: sample space, events, etc.^1

We start with data. Someone (maybe you) has done an experiment, made observations of some stochastic process, and collected some data. These data take values in some “sample space” Ω: this is the set of all things that can possibly happen, no matter how crazy. We define an “event” as some set of points in sample space, i.e. a subset of what is possible. E.g.,

this die comes up 6
this die comes up even
i win the lottery tomorrow
the sky turns green a year from now

The “probability” of an event is a fairly intutive concept. We can think of probability in at least two (non-exclusive) ways:

the average frequency of the event occurring
one’s belief that the event will happen

Let’s make this more mathematically precise. We say a probability func- tion is a scalar function on sets (i.e., you plug in a given set A ⊂ Ω, and you get a single number out) if the function satisfies three conditions:

positivity : P (A) ≥ 0

normalization : P (Ω) = 1

(sub-)additivity : Ai ∩ Aj = Ø ∀i %= j =⇒ P (∪i Ai ) =

i

P (Ai )

These conditions have some important implications, which you should check for yourself: P (Ac^ ) = 1 − P (A) P (Ø) = 0 (^1) See HMC 1.1-1.3. The abbreviation “HMC” refers to the text we’re using, Introduction to Mathematical Statistics, by Hogg, McKean, and Craig, 6th edition.

P (A ∪ B) = P (A) + P (B) − P (A ∪ B)

The simplest case of this is the “flat” probability function: all events of equal “size” have equal P. (Of course, figuring out which events have equal “size” can be a matter of some subjectivity.) The usual examples of sample spaces equipped with flat probability func- tions:

fair coin flips
fair spin of the roulette wheel
etc.

The key point here is that computing the probability of any event you can think of here just comes down to counting:

P (A) =

#ω ∈ A #ω ∈ Ω

(Of course, life is not always fair - sometimes some equally-sized events of happen more often than others, as we’ll see repeatedly...)

Example. Imagine a disease occurs with 0.1 % frequency in the population. Now let’s say there’s a blood test that comes back positive with 99 % probability if the disease is present (i.e., 99 % correct), but 2 % correct if not (i.e., 2 % false alarm rate). What is the conditional probability that someone has the disease, given a positive test result? Let’s translate this into mathematical language. Event A = “disease positive.” Event B = “test positive.” We want P (A|B), and we’ve been given P (A) and P (B|A). We use Bayes: P (A|B) = P (A ∩ B)/P (B) = P (B|A)P (A)/P (B) We just need

P (B) = P (B|A)P (A) + P (B|Ac^ )P (Ac^ ) = 0. 99 · 0 .001 + 0. 02 · 0 .999 = 0. 021.

So

P (A|B) = P (B|A)P (A)/P (B) = 0. 99 · 0. 001 / 0 .021 = 0. 047 ,

i.e., less than 5 %.

This answer seems surprisingly low given that the test is fairly reliable. The explanation of this possibly counterintutive result is that the prior prob- ability of the disease is so small that observing the data of the blood test only perturbs this probability slightly. We’ll see another example of a possibly counterintuitive conditional prob- ability in the homework. (Exercise 1: Mr. Hall’s doors, exercise 1.4.30 in HMC.)

Independence

Independence may be considered the single most important concept in prob- ability theory, demarcating the latter from measure theory and fostering and independent development. In the course of this evolution, probability theory has been fortified by its links with the real world, and indeed the definition of independence is the abstract counterpart of a highly intuitive and empirical notion.

Chow and Teicher, 1978

Now for a concept which is again very intuitive, but turns out to have some incredibly deep implications. We say two events A and B are “independent” if seeing A tells us nothing about B, and vice versa. More mathematically, assume P (A), P (B) > 0. Then A and B are independent events if

P (A|B) = P (A)

thus,

P (A|B)P (B) = P (A ∩ B) = P (B|A)P (A)

= P (A)P (B)

so therefore

P (B|A) = P (B),

too.

Continuous r.v.’s, on the other hand, can take values in a continuum, and the probability of any specific outcome is taken to be zero. Examples:

the angle at which a merry-go-round comes to rest
the interval of time between the arrival of two buses
the temperature at 3 pm tomorrow

The cumulative distribution function can be defined as above:

F (u) = P (X(ω) ≤ u).

Note that this is a monotonically increasing, continuous function. Also, we have the basic formula

P (a < X(ω) < b) = F (b) − F (a).

However, the pmf makes less sense, because for continuous r.v.’s, by def- inition, P (X = u) = 0 ∀u. Instead we’ll define the “probability density function,” or pdf, as any function p(u) ≥ 0 such that

∫ (^) ∞

−∞

f (u)du = 1

and

P (X ∈ A) =

A

f (u)du.

Transformations of random variables^4

A function of an r.v. is another r.v. (I.e., if X(ω) is an r.v., then so is g(X(ω)), for any real function g(.).) In fact, as we will see soon, passing an r.v. through some function is a very convenient and general way to come up with other r.v.’s. The natural question is, if we know the distribution of X, how do we get the distribution of g(X)? In the discrete case, this is pretty straightforward. Let’s say g sends X to one of m possible discrete values. (Note that g(X) must be discrete whenever X is.) To get p (^) g (i), the pmf of g at the i-th bin, we just look at the pmf at any values of X that g mapped to i. More mathematically,

p (^) g (i) =

j∈g −^1 (i)

p (^) X (j).

Here the “inverse image” g −^1 (A) is the set of all points x satisfying g(x) ∈ A. Things are only slightly more subtle in the continuous case. We’ll talk about two methods: one based on cdf’s and one based on pdf’s. The cdf of g(X) is just the probability that g(X) is less than or equal to u. Therefore we just need to look at P (X ∈ g −^1 (y : y ≤ u)). This is often straightforward to do. In the second case we deal with densities instead. Here there’s one slight twist that’s most easily explained if we think of a very simple scaling trans- formation: think about the density of g(X) = cX, where c is a constant and X ∼ p (^) X (x). Applying the logic of the discrete approach above, we might think that the density of Y = g(X) would be

p (^) Y = p (^) X (g −^1 (Y )) = p (^) X (c −^1 Y ). But this is wrong,

as we see when we try to integrate p (^) Y , since ∫ (^) ∞

−∞

p (^) Y dy = |c|.

So the correct density is

p (^) Y =

|c|

p (^) X (c −^1 Y ).

(^4) HMC 1.6.1, 1.7.

Joint, marginal, and conditional distributions^5

Note that we can talk about more than one r.v. at once. Obviously we can simultaneously define as many functions on the sample space as we want. The natural question is, how are these r.v.’s related? Can one r.v. tell us anything about another? It’s helpful to define a couple new concepts here. First, the “joint distri- bution” of two r.v.’s X and Y is defined as

F (u 1 , u 2 ) = P (X(ω) ≤ u 1 ∩ X(ω) ≤ u 2 ).

This is like a pairwise cdf. (It’s obvious how to generalize this to more than two r.v.’s at once.) We can also define joint pmf’s and pdf’s: for example, a joint pdf is a function f (u 1 , u 2 ) ≥ 0 such that

∫ (^) ∞

−∞

f (u 1 , u 2 )du 1 du 2 = 1

and

P ({X, Y } ∈ A) =

A

f (u 1 , u 2 )du 1 du 2.

Now we can talk about how much X tells us about Y , because the in- dependence of sets has a simple analog in the independence of r.v.’s. We say X and Y are independent if their joint pdf can be written as a product function, f (u 1 , u 2 ) = f 1 (u 1 )f 2 (u 2 ).

More generally (since not every r.v. has a nice pdf), X and Y are independent if P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).

R.v.’s which are not independent are called (naturally enough) “dependent” — but note that this is a mathematical definition, and that mathematically dependent r.v.’s may in fact not be directly coupled in any mechanistic way. We can also talk about conditional probabilities related to r.v.’s: the conditional probability distribution of X given Y is

F (u|Y ∈ A) = P (X ≤ u|Y ∈ A). (^5) HMC 2.1-2.

Again, if F (u|Y ∈ A) "= F (u)

for some set A with positive probability P (Y ∈ A), then Y tells us something about X. Conditional densities and pmf’s may be defined in the natural way from the relevant distribution functions (although sometimes we might run into problems defining conditional densities, since the process involves a limit that does not always exist). We can also invert the process. Let’s say we’re handed the joint distri- bution of X and Y , but we really only care about X. How do we get P (X)? We have to use the summation rule for probability,

P (X) =

i

P (X ∩ {Y = i}).

I.e., for continuous r.v.’s, we have

F (u 1 ) =

−∞

du 2

∫ (^) u

−∞

f (u 1 , u 2 )du 1.

For reasons which are not entirely clear to me, this process of integrating over the r.v. we don’t care about is called “marginalization.” So F (u) is the “marginal” (or prior) distribution of X, to distinguish it from the conditional (or posterior) distribution given Y.

names.) It’s common to think of the mean as the “center of gravity” of a pdf. The variance, on the other hand, is defined to measure the “spread” of the pdf, or how variable the underlying r.v. is. It’s defined in terms of second moments:

V (f ) =

−∞

f (u)(u − E(u))^2 = E 2 (f ) − E 1 (f )^2.

Note that the variance is zero only if the r.v. is constant (i.e., not at all variable), and always nonnegative. Also, V (cX) = c 2 V (X),

for any constant c. For this reason, people often use the “standard deviation” instead, that is, σ(X) ≡

V (X),

since σ(cX) = |c|σ(X) and is therefore slightly more useful as a measure of the “scale” of X. We’ll mention a couple other important expectations soon. For now, it’s worth noting that the probability function associated with a random variable X, P (X ∈ A), is itself a kind of expectation, if we define the random variable

I (^) A (X) = 1(X ∈ A);

then P (X ∈ A) = E(I (^) A ).

So we can go back and forth between probability and expectation as desired. One last thing worth remembering: expectations don’t always exist, or they can be infinite. We’ll see examples of this soon, but the reason is clear: sometimes the integral in the definition of the expectation is either infinite or fails to converge. This situation actually does come up in practice: for example, the “fat-tailed” distributions that people sometimes use in financial models often have infinite moments.

Conditional expectations and independence

We can also define conditional expectations in the obvious way: the condi- tional expectation of X given Y is

E(X|Y ) =

−∞

xp(X = x|Y )dx.

Note the important formula

EY (EX (X|Y )) = E(X),

which follows by an interchange of expectations (as always, under the proviso that the relevant expectations exist). If we recall the link between probabilities and expectations, we can also define independence of r.v.’s in terms of expectations. For example, X and Y are independent if

E(f (X)g(Y )) = E(f (X))E(g(Y )),

for any functions f and g s.t. the relevant expectations exist. Using the same logic, X and Y are independent if

E(f (X)|Y ) = E(f (X)),

for any function f s.t. the relevant expectations exist. (Note that E(X|Y ) = E(X) is insufficient for independence. Exercise 5: Give an example of dependent X and Y such that E(X|Y ) = E(X).)

Moment-generating functions; convolution^8

Perhaps the most important transformation we’ll encounter is multidimen- sional: if X ∼ p (^) X (x) and Y ∼ p (^) Y (y) are independent, what is the distribution of X + Y? The easiest way to approach this is to think about the relevant cdf’s.

P (X + Y ≤ u) =

X+Y ≤u

P (X, Y )dXdY

−∞

dX

∫ (^) u−X

−∞

P (X, Y )dY =

−∞

dY

∫ (^) u−Y

−∞

P (X, Y )dX

−∞

dx

∫ (^) u−x

−∞

p (^) X (x)p (^) Y (y)dy (by independence)

Exercise 8: Use the second formula above to prove, finally, that E(X + Y ) = E(X) + E(Y ), even if X and Y are dependent. (If you don’t feel like using the above formula, prove this using any method you want.) It’s worth noting that we can generalize this approach very easily, e.g., if we want to know the distribution of X · Y , or X/Y ; the only things that change in the above formula are the sets we are integrating over. Here are some examples for practice: Exercise 9: Assuming X ∼ p (^) X (x) and Y ∼ p (^) Y (y), with X and Y independent, what is the distribution of X · Y? Exercise 10: What is the distribution of max Xi , if N r.v.’s Xi are drawn i.i.d. from p (^) X (x)? Exercise 11: What is the distribution of X(j) , the j-th smallest Xi? (This is called the j-th “order statistic” of the sample. Order statistics are helpful summaries of data; for example, the median — the i/2- th order sample — or the range — X(N ) − X(1) .) Exercise 12: What is the distribution of the range? If we differentiate this w.r.t. u we get the pdf:

p (^) X+Y (u) =

−∞

p (^) X (x)p (^) Y (u − x)dx.

This expression is common enough (not just in statistics, but also in physics, engineering, etc.) that it has its own name: we say h(u) is the “convolution” of functions f (u) and g(u) if

h(u) = f ∗ g(u) ≡

−∞

f (x)g(u − x)dx =

−∞

f (u − x)g(x)dx.

(^8) HMC 1.

Note that convolution is a smoothing operation; roughness in f or g gets averaged over in h. In fact, you can prove that h is always at least as differen- tiable (in the sense of having k smooth derivatives) as the most differentiable of f and g. (Exercise 13: Prove this yourself.)

Moment-generating functions

Here we’re going to introduce a trick that turns out to be more helpful than it would seem to have any right to be. Let’s look at the expectation of exponentials of a r.v. X:

M (^) X (s) = E(esX^ ) =

−∞

esu^ p (^) X (u)du.

Or for multiple r.v.’s simultaneously,

M (^) X 1 ,X 2 (s 1 , s 2 ) = E(es^1 X^1 +s^2 X^2 ) =

−∞

es^1 u^1 +s^2 u^2 p (^) X 1 ,X 2 (u 1 , u 2 )du 1 du 2.

These expectations are called “moment generating functions,” abbrevi- ated mgf’s. Why? Because

∂ i ∂s i^

M (^) X (s)

s=

= Ei (X),

which is a handy trick if the distribution p (^) X is complicated but M (^) X is simple. (Often derivatives are easier to compute than integrals.) A similar trick works in higher dimensions: Exercise 14: How would you compute the covariance of X and Y in terms of M (^) X,Y? Another nice feature of the mgf stems from the fact that 1) it’s easy to deal with products of exponentials, and 2) independent r.v.’s have product distributions. This, in turn, means that

M (^) X 1 ,X 2 ,... (s 1 , s 2 , ...) =

i

M (^) X (^) i (s (^) i )

if and only if the r.v.’s Xi are independent. This immediately tells us something interesting about convolutions and the distribution of sums of independent r.v.’s:

M (^) X+Y (s) = E(es(X+Y^ )^ ) = E(esX+sY^ )^ ) = E(esX^ )E(esY^ )^ ) = M (^) X (s)M (^) Y (s).

Statistics Introduction - Essay - Mathematics, Essays (high school) of Mathematics

Related documents

Partial preview of the text

Download Statistics Introduction - Essay - Mathematics and more Essays (high school) Mathematics in PDF only on Docsity!

Introduction

Outline

Basics: sample space, events, etc.^1

P (A ∪ B) = P (A) + P (B) − P (A ∪ B)

P (A) =

Independence

P (A|B) = P (A)

P (A|B)P (B) = P (A ∩ B) = P (B|A)P (A)

= P (A)P (B)

Transformations of random variables^4

Joint, marginal, and conditional distributions^5

V (X),

Conditional expectations and independence

Moment-generating functions; convolution^8

Moment-generating functions