Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Basic Probability Reference Sheet, Study notes of Linear Algebra

University of California - Berkeley Linear Algebra

Thus, if the variance of a coin flip (heads = 1, tails = 0) is 0.25 (0.5 * 0.5), the variance of the sum of 100 coin flips is 25, ...

Typology: Study notes

2022/2023

Uploaded on 02/28/2023

ekaashaah 🇺🇸

4.4

(41)

273 documents

1 / 8

This page cannot be seen from the preview

Don't miss anything!

February 27, 2001

1 of 8

Basic Probability

Reference Sheet

17.846, 2001

This is intended to be used in addition to,

not as a substitute for, a textbook.

X is a random variable. This means that X is a variable that takes on value X with probability x. This

is the density function and is written as:

for all

The cumulative probability distribution is the probability that X takes on a value less than or equal to

x. This is written as:

for all

A probability distribution is any function such that

Prob(X=x) = fx()

X domain∈

Prob(X x)≤=Fx()

X domain∈

fx() 0 and fx()xd

∞–

∞

∫

>1=

Discover Study notes of Linear Algebra University of California - Berkeley

Partial preview of the text

Download Basic Probability Reference Sheet and more Study notes Linear Algebra in PDF only on Docsity!

February 27, 2001

1 of 8

Basic Probability

Reference Sheet

This is intended to be used in addition to,not as a substitute for, a textbook.

X is a random variable. This means that X is a variable that takes on value X with probability x. Thisis the density function and is written as:

for all

The cumulative probability distribution is the probability that X takes on a value less than or equal tox. This is written as:

for all

A probability distribution is any function such that

Prob(X=x) = f ( x )

X ∈ domain

Prob(X ≤ x )= F x ( )

X ∈ domain

f ( x ) 0 and f ( x ) dx

∞

2 of 8 Basic Probability Reference Sheet

or if X is discrete

These say that the probability of any event if zero or positive, and that the sum of the probabilities ofall events must equal one. (In other words, you can’t have a negative probability of something hap-

pening, and something must happen.)

Two important probability distributions are the standard normal and the uniform[0,N]:

and

Most probability distributions have a mean and a variance.

Mean = and Variance =

The standard deviation (the average distance that the sample random variable is from the truemean) is equal to the square root of the variance.

Mean is also equal to the first moment, or expected value of X.

Mean = E[X]

The mean is the average value of X, weighted by the probability that X = x, for all values of x.For a continuous distribution, this is:

or for discrete variables

The expected value operator is linear. That mean:

x ∈∑ domain^^ p x (^ ) =^1

f ( x ) ----------^12 π e

x

2 ---^

-^2

f ( x ) = N ---^1 -

μ σ^2

μ x • f ( x ) dx

∞ = (^) ∫ μ = x ∈ domain ∑ x • f ( x )

4 of 8 Basic Probability Reference Sheet

Thus, if the variance of a coin flip (heads = 1, tails = 0) is 0.25 (0.5 * 0.5), the variance of the sumof 100 coin flips is 25, not 0.25*(100 squared). This is because the sum centers around 50... in

other words, the chance of getting 100 heads is really small. Thus, if we were to flip 100 coinsand take the sum, then repeat this exercise 1000 times, we would get a bell shaped curve. If we

flip two coins, take the sum, and repeat 1000 times, we do not get a bell shaped curve. We get ahistogram with 0 25% of the time, 1 50% of the time, and 2 25% of the time. This yields a vari-

ance of:

If we flip just one coin, we get a 0 half the time and a 1 half the time, for a variance of 0.25.

This may not sound so important, but it becomes important when we combine it with the follow-ing rule:

This equation can be derived directly from the expectation formulas, and is highly intuitive. If wego back to our single coin flip, the variance of a single coin flip is:

But the variance of 100 * a single coin flip is:

So you can see the difference between the variance of 100 times a single coin flip, and the vari-ance of the sum of 100 coin flips. Mathematically:

Now we can caluculate the variance of a sample mean from the variance of the random variableitself:

Or, using standard deviations:

[ 0.25 2( – 1 )^2 + 0.5 1( – 1 )^2 +0.25 1( – 0 )^2 ] =0.

Var aX ( + b ) = a^2 Var X ( )

0.5 1( – 0.5)^2 + 0.5 0( – 0.5)^2 =0.

0.5 100( – 50 )^2 + 0.5 0( – 50 )^2 = 2500 = 100 2 ⋅0.

Var kX ( ) = E kX [ – E kX [ ]]^2 = E k X [ ( – E X [ ])]^2 = k^2 E X [ – E X [ ]]^2 = k^2 Var X ( )

Var X ( ) Var^ --- N^1 -^ i = 1 Xi N  ∑ ^ ^ ---^ N^1 - ^2^ Var^ i = 1 Xi

N = = (^)  ∑ ^ = (^) N ------^12 ⋅ N ⋅ Var Xi ( ) =^ --- N^1 -^ ⋅ Var Xi ( )

Basic Probability Reference Sheet 5 of 8

So after estimating the mean and variance of X normally, you can estimate the mean and varianceof the mean of X. If you combine this with the central limit theorem, this forms the entire basis of

hypothesis testing and confidence interval estimation in regression analysis.

So far we have talked about the true mean and true variance. In real life, we don’t know the truevalues. All we see is the data geenrated by some mysterious natural or human process. We

assume this process can be described by a mathematical relationship (such as a probability distri-bution). If so, and if the probability distribution which determines how the world operates isn’t

too weird, we can estimate the parameters of that distribution. We do so by using the data we findin the world.

The problem is that the world is noisy - is full of error. Or perhaps our measurements are noisy.Either way, we need some way of accounting for or at least quantifying this noise. This is why we

estimate the variance of our observations.

So, if X follows some distribution, and we take 100 samples of X...

{X1, X2, X3, X4, X5.... X99 X100}

We can use this sample to estimate the value of the mean of whatever distribution generated X.

Our best guess at this true mean is in fact , which can be defined:

It turns out that this is an unbiased estimated of the mean. In other words, the expected value ofthis sum is equal to the true mean. This is true because the expected value of any single Xi is

equal to the mean, so 1/N times N such expected values of Xi is also equal to the mean. The onlyadvantage of using more observations is that the extimate of the mean becomes more precise. We

measure this using the variance.

First, we estimate the variance of Xi. We do this with the formula:

Stdev X ( ) = Var X ( ) = (^) N ---^1 - ⋅ Var X ( )= -------^1 N - ⋅ Stdev X ( )

X

X = N^ ---^1 -^ i ∑^100 = 1 Xi

σˆ^2 = N ------------^1 – 1 -^ i ∑ =^ N 1 ( X – X )^2

Basic Probability Reference Sheet 7 of 8

What is a regression? Geometrically, a regression is the projection of a single vector (your depen-dent variable) of a dimension N (equal to your number of observations) onto a K-dimensional

subspace spanned by the K vectors which we call independent variables (also of dimension N,that is, with N observations). The subspace spanned by your independent variable vectors defines

a hyperplane - a K-dimensional space which can be reached by some combination of your inde-pendent variable vectors. Regression then finds the point in that hyperplane which is closest to

the point reached by your dependent variable vector. In fact, it constructs a whole line of points inthe independent variable hyperplane which are closest to the points in your dependent variable

vector. It then tells you in what combination you should combine your independent variable vec-tors to reach this closet-points vector (or the best guess vector). The best-guess vector is also

called the projection of Y onto the subspace spanned by the X’s. Your coefficients B1, B2, etc...tell us how to create the projection line given our X1, X2, etc... vectors. The vector of error terms

defines a vector perpendicular to your dependent variable hyperplane. By combining your X’s inthe proportions defined by your Bs, and then adding the error term, you get back to your Y vector.

Back to the real(?) world. To calculate your B coefficients, you write down the projection equa-tion:

Notice that if X and Y have only one variable, this reduces to:

Actually, even sinple regressions with one dependent variable usually include an extra X vector ofconstants, or 1’s, in order to allow for an intercept. If the 1’s vector is the only vector in the matrix

X, then B equals the mean of Y. If there is another vector in addition to the 1’s vector, then:

If you are actually regressing X on Y, the reverse regression yields:

The only difference is the scaling factor in the denominator. For an unbiased scaling factor, wecan use:

Y ˆ^ = X ' ( X ' X ) –^1 X ' Y = X ' B where B =( X ' X ) –^1 X ' Y

B ˆ^ =∑ --------------- ∑^ XYX 2 -

B ˆ^ ∑^ (^ X^ – ( XX )–^ ⋅ X (^ Y )^ 2 – Y ) ∑ = -------------------------------------------------- = Cov X Y ------------------------- Var X (^ ( ,))-

B ˆ^ = ∑ -------------------------------------------------^ (^ X ∑^ – ( X Y )–^ ⋅ Y (^ Y )^ 2 – Y )- = Cov X Y ------------------------- Var Y (^ ( ,))-

8 of 8 Basic Probability Reference Sheet

The correlation coefficient is always bounded between -1 and 1. To grasp the intuition behindthis, imagine that X and Y are plotted, and form a perfect line. Then the correlation is 1 or -1.

When you introduce noise to that line, the noise terms are summed and then multiplied in thedenominator, whereas they are multiplied and then summed in the numerator. So the numerator is

smaller. Imagine that we have mean deviated the variables already, and thus the sample means areequal to zero. Then the equation reduces to:

Since we have mean-deviated, X and Y measure the distance (positive or negative) away from 0.Thus, their product sort of measures how they move together. If you have two long X and Y vec-

tors, and for each observation Y is always positive when X is positive, and Y is always negativewhen X is negative, then you will have positive correlation. If the reverse is true, you will have

negative covariation. Moreover, the degree to which X and Y move the same amount in the samedirection yields the magnitude of the correlation. If X = Y always, you get 1, as you can see.

Estimated Correlation X Y ( , ) ∑( X (^ X –^ – XX ) 2 )^ ⋅(^ Y^ – Y ) ∑ ∑( Y^^ – Y )^2

= ------------------------------------------------------------------- = Stdev X -------------------------------------------------- Cov X Y ( )( ⋅ Stdev Y , ) ( )-

∑^ XY --------------------------------∑ X (^2) ∑ Y 2 - ( (^) ∑ XY )^2 = ∑ ---------------------------- X (^2) ∑ Y 2 -

Basic Probability Reference Sheet, Study notes of Linear Algebra

Related documents

Partial preview of the text

Download Basic Probability Reference Sheet and more Study notes Linear Algebra in PDF only on Docsity!

Basic Probability

Reference Sheet

X is a random variable. This means that X is a variable that takes on value X with probability x. Thisis the density function and is written as:

for all

The cumulative probability distribution is the probability that X takes on a value less than or equal tox. This is written as:

for all

A probability distribution is any function such that

or if X is discrete

These say that the probability of any event if zero or positive, and that the sum of the probabilities ofall events must equal one. (In other words, you can’t have a negative probability of something hap-

pening, and something must happen.)

Two important probability distributions are the standard normal and the uniform[0,N]:

and

Most probability distributions have a mean and a variance.

Mean = and Variance =

The standard deviation (the average distance that the sample random variable is from the truemean) is equal to the square root of the variance.

Mean is also equal to the first moment, or expected value of X.

Mean = E[X]

The mean is the average value of X, weighted by the probability that X = x, for all values of x.For a continuous distribution, this is:

or for discrete variables

The expected value operator is linear. That mean:

x

2 ---^

-^2

Thus, if the variance of a coin flip (heads = 1, tails = 0) is 0.25 (0.5 * 0.5), the variance of the sumof 100 coin flips is 25, not 0.25*(100 squared). This is because the sum centers around 50... in

other words, the chance of getting 100 heads is really small. Thus, if we were to flip 100 coinsand take the sum, then repeat this exercise 1000 times, we would get a bell shaped curve. If we

flip two coins, take the sum, and repeat 1000 times, we do not get a bell shaped curve. We get ahistogram with 0 25% of the time, 1 50% of the time, and 2 25% of the time. This yields a vari-

ance of:

If we flip just one coin, we get a 0 half the time and a 1 half the time, for a variance of 0.25.

This may not sound so important, but it becomes important when we combine it with the follow-ing rule:

This equation can be derived directly from the expectation formulas, and is highly intuitive. If wego back to our single coin flip, the variance of a single coin flip is:

But the variance of 100 * a single coin flip is:

So you can see the difference between the variance of 100 times a single coin flip, and the vari-ance of the sum of 100 coin flips. Mathematically:

Now we can caluculate the variance of a sample mean from the variance of the random variableitself:

Or, using standard deviations:

[ 0.25 2( – 1 )^2 + 0.5 1( – 1 )^2 +0.25 1( – 0 )^2 ] =0.

0.5 1( – 0.5)^2 + 0.5 0( – 0.5)^2 =0.

0.5 100( – 50 )^2 + 0.5 0( – 50 )^2 = 2500 = 100 2 ⋅0.

So after estimating the mean and variance of X normally, you can estimate the mean and varianceof the mean of X. If you combine this with the central limit theorem, this forms the entire basis of

hypothesis testing and confidence interval estimation in regression analysis.

So far we have talked about the true mean and true variance. In real life, we don’t know the truevalues. All we see is the data geenrated by some mysterious natural or human process. We

assume this process can be described by a mathematical relationship (such as a probability distri-bution). If so, and if the probability distribution which determines how the world operates isn’t

too weird, we can estimate the parameters of that distribution. We do so by using the data we findin the world.

The problem is that the world is noisy - is full of error. Or perhaps our measurements are noisy.Either way, we need some way of accounting for or at least quantifying this noise. This is why we

estimate the variance of our observations.

So, if X follows some distribution, and we take 100 samples of X...

{X1, X2, X3, X4, X5.... X99 X100}

We can use this sample to estimate the value of the mean of whatever distribution generated X.

Our best guess at this true mean is in fact , which can be defined:

It turns out that this is an unbiased estimated of the mean. In other words, the expected value ofthis sum is equal to the true mean. This is true because the expected value of any single Xi is

equal to the mean, so 1/N times N such expected values of Xi is also equal to the mean. The onlyadvantage of using more observations is that the extimate of the mean becomes more precise. We

measure this using the variance.

First, we estimate the variance of Xi. We do this with the formula:

X

What is a regression? Geometrically, a regression is the projection of a single vector (your depen-dent variable) of a dimension N (equal to your number of observations) onto a K-dimensional

subspace spanned by the K vectors which we call independent variables (also of dimension N,that is, with N observations). The subspace spanned by your independent variable vectors defines

a hyperplane - a K-dimensional space which can be reached by some combination of your inde-pendent variable vectors. Regression then finds the point in that hyperplane which is closest to

the point reached by your dependent variable vector. In fact, it constructs a whole line of points inthe independent variable hyperplane which are closest to the points in your dependent variable

vector. It then tells you in what combination you should combine your independent variable vec-tors to reach this closet-points vector (or the best guess vector). The best-guess vector is also

called the projection of Y onto the subspace spanned by the X’s. Your coefficients B1, B2, etc...tell us how to create the projection line given our X1, X2, etc... vectors. The vector of error terms

defines a vector perpendicular to your dependent variable hyperplane. By combining your X’s inthe proportions defined by your Bs, and then adding the error term, you get back to your Y vector.

Back to the real(?) world. To calculate your B coefficients, you write down the projection equa-tion:

Notice that if X and Y have only one variable, this reduces to:

Actually, even sinple regressions with one dependent variable usually include an extra X vector ofconstants, or 1’s, in order to allow for an intercept. If the 1’s vector is the only vector in the matrix

X, then B equals the mean of Y. If there is another vector in addition to the 1’s vector, then:

If you are actually regressing X on Y, the reverse regression yields:

The only difference is the scaling factor in the denominator. For an unbiased scaling factor, wecan use:

The correlation coefficient is always bounded between -1 and 1. To grasp the intuition behindthis, imagine that X and Y are plotted, and form a perfect line. Then the correlation is 1 or -1.

When you introduce noise to that line, the noise terms are summed and then multiplied in thedenominator, whereas they are multiplied and then summed in the numerator. So the numerator is

smaller. Imagine that we have mean deviated the variables already, and thus the sample means areequal to zero. Then the equation reduces to:

Since we have mean-deviated, X and Y measure the distance (positive or negative) away from 0.Thus, their product sort of measures how they move together. If you have two long X and Y vec-

tors, and for each observation Y is always positive when X is positive, and Y is always negativewhen X is negative, then you will have positive correlation. If the reverse is true, you will have

negative covariation. Moreover, the degree to which X and Y move the same amount in the samedirection yields the magnitude of the correlation. If X = Y always, you get 1, as you can see.