Download Basic Probability Reference Sheet and more Study notes Linear Algebra in PDF only on Docsity!
February 27, 2001
1 of 8
Basic Probability
Reference Sheet
This is intended to be used in addition to,not as a substitute for, a textbook.
X is a random variable. This means that X is a variable that takes on value X with probability x. Thisis the density function and is written as:
for all
The cumulative probability distribution is the probability that X takes on a value less than or equal tox. This is written as:
for all
A probability distribution is any function such that
Prob(X=x) = f ( x )
X ∈ domain
Prob(X ≤ x )= F x ( )
X ∈ domain
f ( x ) 0 and f ( x ) dx
∞
2 of 8 Basic Probability Reference Sheet
or if X is discrete
These say that the probability of any event if zero or positive, and that the sum of the probabilities ofall events must equal one. (In other words, you can’t have a negative probability of something hap-
pening, and something must happen.)
Two important probability distributions are the standard normal and the uniform[0,N]:
and
Most probability distributions have a mean and a variance.
Mean = and Variance =
The standard deviation (the average distance that the sample random variable is from the truemean) is equal to the square root of the variance.
Mean is also equal to the first moment, or expected value of X.
Mean = E[X]
The mean is the average value of X, weighted by the probability that X = x, for all values of x.For a continuous distribution, this is:
or for discrete variables
The expected value operator is linear. That mean:
x ∈∑ domain^^ p x (^ ) =^1
f ( x ) ----------^12 π e
x
2 ---^
-^2
f ( x ) = N ---^1 -
μ σ^2
μ x • f ( x ) dx
∞ = (^) ∫ μ = x ∈ domain ∑ x • f ( x )
4 of 8 Basic Probability Reference Sheet
Thus, if the variance of a coin flip (heads = 1, tails = 0) is 0.25 (0.5 * 0.5), the variance of the sumof 100 coin flips is 25, not 0.25*(100 squared). This is because the sum centers around 50... in
other words, the chance of getting 100 heads is really small. Thus, if we were to flip 100 coinsand take the sum, then repeat this exercise 1000 times, we would get a bell shaped curve. If we
flip two coins, take the sum, and repeat 1000 times, we do not get a bell shaped curve. We get ahistogram with 0 25% of the time, 1 50% of the time, and 2 25% of the time. This yields a vari-
ance of:
If we flip just one coin, we get a 0 half the time and a 1 half the time, for a variance of 0.25.
This may not sound so important, but it becomes important when we combine it with the follow-ing rule:
This equation can be derived directly from the expectation formulas, and is highly intuitive. If wego back to our single coin flip, the variance of a single coin flip is:
But the variance of 100 * a single coin flip is:
So you can see the difference between the variance of 100 times a single coin flip, and the vari-ance of the sum of 100 coin flips. Mathematically:
Now we can caluculate the variance of a sample mean from the variance of the random variableitself:
Or, using standard deviations:
[ 0.25 2( – 1 )^2 + 0.5 1( – 1 )^2 +0.25 1( – 0 )^2 ] =0.
Var aX ( + b ) = a^2 Var X ( )
0.5 1( – 0.5)^2 + 0.5 0( – 0.5)^2 =0.
0.5 100( – 50 )^2 + 0.5 0( – 50 )^2 = 2500 = 100 2 ⋅0.
Var kX ( ) = E kX [ – E kX [ ]]^2 = E k X [ ( – E X [ ])]^2 = k^2 E X [ – E X [ ]]^2 = k^2 Var X ( )
Var X ( ) Var^ --- N^1 -^ i = 1 Xi N ∑ ^ ^ ---^ N^1 - ^2^ Var^ i = 1 Xi
N = = (^) ∑ ^ = (^) N ------^12 ⋅ N ⋅ Var Xi ( ) =^ --- N^1 -^ ⋅ Var Xi ( )
Basic Probability Reference Sheet 5 of 8
So after estimating the mean and variance of X normally, you can estimate the mean and varianceof the mean of X. If you combine this with the central limit theorem, this forms the entire basis of
hypothesis testing and confidence interval estimation in regression analysis.
So far we have talked about the true mean and true variance. In real life, we don’t know the truevalues. All we see is the data geenrated by some mysterious natural or human process. We
assume this process can be described by a mathematical relationship (such as a probability distri-bution). If so, and if the probability distribution which determines how the world operates isn’t
too weird, we can estimate the parameters of that distribution. We do so by using the data we findin the world.
The problem is that the world is noisy - is full of error. Or perhaps our measurements are noisy.Either way, we need some way of accounting for or at least quantifying this noise. This is why we
estimate the variance of our observations.
So, if X follows some distribution, and we take 100 samples of X...
{X1, X2, X3, X4, X5.... X99 X100}
We can use this sample to estimate the value of the mean of whatever distribution generated X.
Our best guess at this true mean is in fact , which can be defined:
It turns out that this is an unbiased estimated of the mean. In other words, the expected value ofthis sum is equal to the true mean. This is true because the expected value of any single Xi is
equal to the mean, so 1/N times N such expected values of Xi is also equal to the mean. The onlyadvantage of using more observations is that the extimate of the mean becomes more precise. We
measure this using the variance.
First, we estimate the variance of Xi. We do this with the formula:
Stdev X ( ) = Var X ( ) = (^) N ---^1 - ⋅ Var X ( )= -------^1 N - ⋅ Stdev X ( )
X
X = N^ ---^1 -^ i ∑^100 = 1 Xi
σˆ^2 = N ------------^1 – 1 -^ i ∑ =^ N 1 ( X – X )^2
Basic Probability Reference Sheet 7 of 8
What is a regression? Geometrically, a regression is the projection of a single vector (your depen-dent variable) of a dimension N (equal to your number of observations) onto a K-dimensional
subspace spanned by the K vectors which we call independent variables (also of dimension N,that is, with N observations). The subspace spanned by your independent variable vectors defines
a hyperplane - a K-dimensional space which can be reached by some combination of your inde-pendent variable vectors. Regression then finds the point in that hyperplane which is closest to
the point reached by your dependent variable vector. In fact, it constructs a whole line of points inthe independent variable hyperplane which are closest to the points in your dependent variable
vector. It then tells you in what combination you should combine your independent variable vec-tors to reach this closet-points vector (or the best guess vector). The best-guess vector is also
called the projection of Y onto the subspace spanned by the X’s. Your coefficients B1, B2, etc...tell us how to create the projection line given our X1, X2, etc... vectors. The vector of error terms
defines a vector perpendicular to your dependent variable hyperplane. By combining your X’s inthe proportions defined by your Bs, and then adding the error term, you get back to your Y vector.
Back to the real(?) world. To calculate your B coefficients, you write down the projection equa-tion:
Notice that if X and Y have only one variable, this reduces to:
Actually, even sinple regressions with one dependent variable usually include an extra X vector ofconstants, or 1’s, in order to allow for an intercept. If the 1’s vector is the only vector in the matrix
X, then B equals the mean of Y. If there is another vector in addition to the 1’s vector, then:
If you are actually regressing X on Y, the reverse regression yields:
The only difference is the scaling factor in the denominator. For an unbiased scaling factor, wecan use:
Y ˆ^ = X ' ( X ' X ) –^1 X ' Y = X ' B where B =( X ' X ) –^1 X ' Y
B ˆ^ =∑ --------------- ∑^ XYX 2 -
B ˆ^ ∑^ (^ X^ – ( XX )–^ ⋅ X (^ Y )^ 2 – Y ) ∑ = -------------------------------------------------- = Cov X Y ------------------------- Var X (^ ( ,))-
B ˆ^ = ∑ -------------------------------------------------^ (^ X ∑^ – ( X Y )–^ ⋅ Y (^ Y )^ 2 – Y )- = Cov X Y ------------------------- Var Y (^ ( ,))-
8 of 8 Basic Probability Reference Sheet
The correlation coefficient is always bounded between -1 and 1. To grasp the intuition behindthis, imagine that X and Y are plotted, and form a perfect line. Then the correlation is 1 or -1.
When you introduce noise to that line, the noise terms are summed and then multiplied in thedenominator, whereas they are multiplied and then summed in the numerator. So the numerator is
smaller. Imagine that we have mean deviated the variables already, and thus the sample means areequal to zero. Then the equation reduces to:
Since we have mean-deviated, X and Y measure the distance (positive or negative) away from 0.Thus, their product sort of measures how they move together. If you have two long X and Y vec-
tors, and for each observation Y is always positive when X is positive, and Y is always negativewhen X is negative, then you will have positive correlation. If the reverse is true, you will have
negative covariation. Moreover, the degree to which X and Y move the same amount in the samedirection yields the magnitude of the correlation. If X = Y always, you get 1, as you can see.
Estimated Correlation X Y ( , ) ∑( X (^ X –^ – XX ) 2 )^ ⋅(^ Y^ – Y ) ∑ ∑( Y^^ – Y )^2
= ------------------------------------------------------------------- = Stdev X -------------------------------------------------- Cov X Y ( )( ⋅ Stdev Y , ) ( )-
∑^ XY --------------------------------∑ X (^2) ∑ Y 2 - ( (^) ∑ XY )^2 = ∑ ---------------------------- X (^2) ∑ Y 2 -