




























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Econometrics study guide explaining concepts
Typology: Study notes
1 / 36
This page cannot be seen from the preview
Don't miss anything!





























Econ 5502, Review Note
The summation operator is a useful shorthand for manipulating expressions involving the sums of many numbers, and it plays a key role in statistics and econometric analysis. If {xi : i = 1,... , n} denotes a sequence of n numbers, then we write the sum of these numbers as (^) n ∑
i=
xi = x 1 + x 2 + · · · + xn
With this definition, the summation operator is easily shown to have the following prop- erties:
Sum.1 For any constant c, ∑n
i=
c = nc
Sum.2 For any constant c, ∑n
i=
cxi = c
∑^ n
i=
xi
Sum.3 For any constant a and b,
∑^ n
i=
(axi + byi) = a
∑^ n
i=
xi + b
∑^ n
i=
yi
It is also important to be aware of some things that cannot be done with the summation operator. Let {(xi, yi) : i = 1, 2 ,... , n} again be a set of n pairs of numbers with yi 6 = 0 for each i. Then, ∑n
i=
xi yi
∑n ∑i=1^ xi n i=1 yi In other words, the sum of the ratios is not the ratio of the sums. In the n = 2 case, the application of familiar elementary algebra also reveals this lack of equality: x 1 /y 1 + x 2 /y 2 = (x 1 + x 2 )/(y 1 + y 2 ). Similarly, the sum of the squares is not the square of the sum:
∑^ n
i=
x^2 i 6 =
( (^) n ∑
i=
xi
, except in special cases. That these two quantities are not generally equal is easiest to see when n = 2: x^21 + x^22 6 = (x 1 + x 2 )^2 = x^21 + 2x 1 x 2 + x^22
Given n numbers xi : i = 1,... , n, we compute their average or mean by adding them up and dividing by n:
x =
n
∑^ n
i=
xi
When the xi are a sample of data on a particular variable, we often call this the sample average (or sample mean) to emphatize that it is computed from a particular set of data. The sample average is an example of descriptive statistics; in this case, the statistic describes the central tendency of the set of point xi. There are some basic properties about averages that are important to understand. First, suppose we take each observation on x and subtract off the average: di = xi − x (the d here stands for deviation from the average). Then, the sum of these deviations is always zero:
∑^ n
i=
di =
∑^ n
i=
(xi − x) =
∑^ n
i=
xi −
∑^ n
i=
x =
∑^ n
i=
xi − nx = nx − nx = 0
We summarize that (^) n ∑
i=
(xi − x) = 0
A simple numerical example shows how this works. Suppose n = 5 and x 1 = 6, x 2 = 1, x 3 = −2, x 4 = 0, and x 5 = 5. Then, x = 2, and the demeaned sample is { 4 , − 1 , − 4 , − 2 , 3 }. Adding these gives zero, which is just what the above result suggests. In our treatment of regression analysis, we need to know some additional algebraic facts involving deviations from sample averages. An important one is that the sum of squared deviations is the sum of the squared xi minus n times the square of x:
∑^ n
i=
(xi − x)^2 =
∑^ n
i=
x^2 i − nx^2
This can be shown using basic properties of the summation operator:
∑^ n
i=
(xi − x)^2 =
∑^ n
i=
x^2 i − 2 xix − x^2
∑^ n
i=
x^2 i − 2 x
∑^ n
i=
xi + nx^2
∑^ n
i=
x^2 i − 2 nx^2 + nx^2 =
∑^ n
i=
x^2 i − nx^2
Given a data set on two variables, {(xi, yi) : i = 1, 2 ,... , n}, it can also be shown that
∑^ n
i=
(xi − x)(yi − y) =
∑^ n
i=
xiyi − nx¯y¯
The APC is not constant, it is always larger than the MPC, and it gets closer to the MPC as income increases.
Linear functions are easily defined for more than two variables. Suppose that y is related to two variables, x 1 and x 2 , in the general form
y = β 0 + β 1 x 1 + β 2 x 2
It is rather difficult to envision this function because its graph is three-dimensional. Nevertheless, β 0 is still the intercept (the value of y when x 1 = 0 and x 2 = 0), and β 1 and β 2 measure particular slopes. From the above equation, the change in y, for given changes in x 1 and x 2 , is ∆y = β 1 ∆x 1 + β 2 ∆x 2 If x 2 does not change, that is, ∆x 2 = 0, then we have
∆y = β 1 ∆x 1 , if∆x 2 = 0
so that β 1 is the slope of the relationship in the direction of x 1 :
β 1 =
∆y ∆x 1
, if∆x 2 = 0
Because it measures how y changes with x 1 , holding x 2 fixed, β 1 is often called the partial effect of x 1 on y. Because the partial effect involves holding other factors fixed, it is closely linked to the notion of ceteris paribus. The parameter β 2 has a similar interpretation: β 2 = ∆y/∆x 2 if ∆x 1 = 0, so that β 2 is the partial effect of x 2 on y.
Example 1.2: For college students, suppose that the monthly quantity demanded of compact discs is related to the price of compact discs and monthly discretionary income by quantity = 120− 9. 8 price+. 03 income, where price is dollars per disc and income is measured in dollars. The demand curve is the relationship between quantity and price, holding income (and other factors) fixed. The slope of the demand curve, -9.8, is the partial effect of price on quantity: holding income fixed, if the price of compact discs increases by one dollar, then the quantity demanded falls by 9.8. (We abstract from the fact that CDs can only be purchased in discrete units.) An increase in income simply shifts the demand curve up (changes the intercept), but the slope remains the same.
Proportions and percentages play such an important role in applied economics that it is necessary to become very comfortable in working with them. Many quantities reported in the popular press are in the form of percentages; a few examples are interest rates,
unemployment rates, and high school graduation rates. An important skill is being able to convert proportions to percentages and vice versa. A percentage is easily obtained by multiplying a proportion by 100. For example, if the proportion of adults in a county with a high school degree is .82, then we say that 82% (82 percent) of adults have a high school degree. Another way to think of percentages and proportions is that a proportion is the decimal form of a percentage. For example, if the marginal tax rate for a family earning $30,000 per year is reported as 28%, then the proportion of the next dollar of income that is paid in income taxes is .28 dollar. When using percentages, we often need to convert them to decimal form. For example, if a state sales tax is 6% and $200 is spent on a taxable item, then the sales tax paid is 200(.06) = $12. If the annual return on a certificate of deposit (CD) is 7.6% and we invest $3,000 in such a CD at the beginning of the year, then our interest income is 3, 000(.076) = $228. As much as we would like it, the interest income is not obtained by multiplying 3, by 7.6. We must be wary of proportions that are sometimes incorrectly reported as percentages in the popular media. If we read, ”The percentage of high school students who drink alcohol is .57,” we know that this really means 57% (not just over one-half of a percent, as the statement literally implies). College volleyball fans are probably familiar with press clips containing statements such as ”Her hitting percentage was .372.” This really means that her hitting percentage was 37.2%. In econometrics, we are often interested in measuring the changes in various quantities. Let x denote some variable, such as an individual’s income, the number of crimes committed in a community, or the profits of a firm. Let x 0 and x 1 denote two values for x: x 0 is the initial value, and x 1 is the subsequent value. For example, x 0 could be the annual income of an individual in 1994 and x 1 the income of the same individual in 1995. The proportionate change in x in moving from x 0 to x 1 , sometimes called the relative change, is simply
x 1 − x 0 x 0
= ∆x/x 0
assuming, of course, that x 0 = 0. In other words, to get the proportionate change, we simply divide the change in x by its initial value. This is a way of standardizing the change so that it is free of units. For example, if an individual’s income goes from $30,000 per year to $36,000 per year, then the proportionate change is 6, 000 / 30 , 000 = .20. It is more common to state changes in terms of percentages. The percentage change in x in going from x 0 to x 1 is simply 100 times the proportionate change:
%∆x = 100
∆x x 0
the notation ”%∆x” is read as ”the percentage change in x.” For example, when income goes from $30,000 to $33,750, income has increased by 12.5%; to get this, we simply multiply the proportionate change, .125, by 100. Again, we must be on guard for proportionate changes that are reported as percentage changes. In the previous example, for instance, reporting the percentage change in income as .125 is incorrect and could lead to confusion.
For example, if y = 6 + 8x − 2 x^2 2 (so β 1 = 8 and β 2 = −2), then the largest value of y occurs at x∗^ = 8/4 = 2, and this value is 6 + 8(2) − 2(2)^2 = 14. The above function implies a diminishing marginal effect of x on y. Suppose we start at a low value of x and then increase x by some amount, say, c. This has a larger effect on y than if we start at a higher value of x and increase x by the same amount c. In fact, once x = x∗, an increase in x actually decreases y. The statement that x has a diminishing marginal effect on y is the same as saying that the slope of the function in above equation decreases as x increases. An application of calculus gives the approximate slope of the quadratic function as
slope =
∆y ∆x
≈ β 1 + 2β 2 x
for ”small” changes in x. [The right-hand side of the above equation is the derivative of the function in equation (1) with respect to x.] Another way to write this is
∆y ≈ (β 1 + 2β 2 x)∆x for ”small” ∆x. (2)
To see how well this approximation works, consider again the function y = 6 + 8x − 2 x^2. Then, according to equation (2), ∆y ≈ (8 − 4 x)∆x. Now, suppose we start at x = 1 and change x by ∆x = 0.1. Using equation (2), ∆y ≈ (8 − 4)(.1) = .4. Of course, we can compute the change exactly by finding the values of y when x = 1 and x = 1.1: y 0 = 6 + 8(1) − 2(1)^2 = 12 and y 1 = 6 + 8(1.1) − 2(1.1)^2 = 12.38, so the exact change in y is .38. The approximation is pretty close in this case. Now, suppose we start at x = 1 but change x by a larger amount: ∆x = 0.5. Then, the approximation gives ∆y = 4(.5) = 2. The exact change is determined by finding the difference in y when x = 1 and x = 1.5. The former value of y was 12, and the latter value is 6 + 8(1.5) − 2(1.5)^2 = 13.5, so the actual change is 1.5 (not 2). The approximation is worse in this case because the change in x is larger. For many applications, equation (2) can be used to compute the approximate marginal effect of x on y for any initial value of x and small changes. And, we can always compute the exact change if necessary.
1.4.2 The Natural Logarithm
The nonlinear function that plays the most important role in econometric analysis is the natural logarithm. In this text, we denote the natural logarithm, which we often refer to simply as the log function, as y = log(x) (3) You might remember learning different symbols for the natural log; ln(x) or log(x) are the most common. These different notations are useful when logarithms with several different bases are being used. For our purposes, only the natural logarithm is important, and so log(x) denotes the natural logarithm throughout this text. This corresponds to the notational usage in many statistical packages, although some use ln(x) [and most calculators use ln(x)]. Economists use both log(x) and ln(x), which is useful to know when you are reading papers in applied economics. The function y = log(x) is defined only for x > 0. It is not very important to know how the values of log(x) are obtained. For our purposes, the function can
be thought of as a black box: we can plug in any x > 0 and obtain log(x) from a calculator or a computer. One important difference between the log and the quadratic function is that when y = log(x), the effect of x on y never becomes negative: the slope of the function gets closer and closer to zero as x gets large, but the slope never quite reaches zero and certainly never becomes negative. The following are useful facts for natural log
The logarithm can be used for various approximations that arise in econometric applica- tions. First, log(1 + x) ≈ x for x ≈ 0. You can try this with x = .02, .1, and .5 to see how the quality of the approximation deteriorates as x gets larger. Even more useful is the fact that the difference in logs can be used to approximate proportionate changes. Let x 0 and x 1 be positive values. Then, it can be shown (using calculus) that
log(x 1 ) − log(x 0 ) ≈ (x 1 − x0)/x0 = ∆x/x 0 (3)
for small changes in x. If we multiply equation (3) by 100 and write ∆ log(x) = log(x 1 ) − log(x 0 ), then 100∆ log(x) ≈ %∆x
for small changes in x. The meaning of ”small” depends on the context, and we will en- counter several examples throughout this text. Why should we approximate the percentage change using equation (3) when the exact percentage change is so easy to compute? Momentarily, we will see why the approximation in equation (3) is useful in econometrics. First, let us see how good the approximation is in two examples. First, suppose x 0 = 40 and x 1 = 41. Then, the percentage change in x in moving from x 0 to x 1 is 2.5%, using 100(x 1 − x 0 )/x0. Now, log(41) − log(40) = .0247 to four decimal places, which when multiplied by 100 is very close to 2.5. The approximation works pretty well. Now, consider a much bigger change: x 0 = 40 and x 1 = 60. The exact percentage change is 50%. However, log(60) − log(40) = .4055, so the approximation gives 40.55%, which is much farther off. Why is the approximation in equation (3) useful if it is only satisfactory for small changes? To build up to the answer, we first define the elasticity of y with respect to x as
∆y ∆x
x y
%∆y %∆x In other words, the elasticity of y with respect to x is the percentage change in y when x increases by 1%. This notion should be familiar from introductory economics.
the exponential function as y = exp(x). Sometimes, the exponential function is written as y = ex, but we will not use this notation. Two important values of the exponential function are exp(0) = 1 and exp(1) = 2.7183 (to four decimal places). The exponential function is the inverse of the log function in the following sense: log[exp(x)] = x for all x, and exp[log(x)] = x for x > 0. In other words, the log ”undoes” the exponential, and vice versa. (This is why the exponential function is sometimes called the anti-log function.) In particular, note that log(y) = β 0 + β 1 x is equivalent to
y = exp(β 0 + β 1 x).
If β 1 > 0, then x has an increasing marginal effect on y. In Example A.6, this means that another year of education leads to a larger change in wage than the previous year of education. Two useful facts about the exponential function are exp(x 1 + x 2 ) = exp(x 1 ) exp(x 2 ) and exp[c. log(x)] = xc.
In the previous section, we asserted several approximations that have foundations in calculus. Let y = f (x) for some function f. Then, for small changes in x,
∆y =
df dx
∆x (7)
where df /dx is the derivative of the function f, evaluated at the initial point x 0. We also write the derivative as dy/dx. For example, if y = log(x), then dy/dx = 1/x. Using equation (7), with dy/dx evaluated at x 0 , we have ∆y = (1/x 0 )∆x
which is the approximation given in equation (3). In applying econometrics, it helps to recall the derivatives of a handful of functions because we use the derivative to define the slope of a function at a given point. We can then use equation (7) to find the approximate change in y for small changes in x. In the linear case, the derivative is simply the slope of the line, as we would hope: if y = β 0 + β 1 x, then dy/dx = β 1. If y = xc, then dy/dx = cxc−^1. The derivative of a sum of two functions is the sum of the derivatives: d[f (x) + g(x)]/dx = df (x)/dx + dg(x)/dx. The derivative of a constant times any function is that same constant times the derivative of the function: d[cf (x)]/dx = c[df (x)/dx]. These simple rules allow us to find derivatives of more complicated functions. Other rules, such as the product, quotient, and chain rules, will be familiar to those who have taken calculus, but we will not review those here. Some functions that are often used in economics, along with their derivatives, are
x; dy/dx = (β 1 /2)x−^1 /^2
When y is a function of multiple variables, the notion of a partial derivative becomes important. Suppose that y = f (x 1 , x 2 ), (8) Then, there are two partial derivatives, one with respect to x 1 and one with respect to x 2. The partial derivative of y with respect to x 1 , denoted here by ∂y/∂x 1 , is just the usual derivative of equation (8) with respect to x 1 , where x 2 is treated as a constant. Similarly, ∂y/∂x 2 is just the derivative of equation (8) with respect to x 2 , holding x 1 fixed.
Suppose that we flip a coin 10 times and count the number of times the coin turns up heads. This is an example of an experiment. Generally, an experiment is any procedure that can, at least in theory, be infinitely repeated and has a well-defined set of outcomes. We could, in principle, carry out the coin-flipping procedure again and again. Before we flip the coin, we know that the number of heads appearing is an integer from 0 to 10, so the outcomes of the experiment are well defined. A random variable is one that takes on numerical values and has an outcome that is determined by an experiment. In the coin-flipping example, the number of heads appearing in 10 flips of a coin is an example of a random variable. Before we flip the coin 10 times, we do not know how many times the coin will come up heads. Once we flip the coin 10 times and count the number of heads, we obtain the outcome of the random variable for this particular trial of the experiment. Another trial can produce a different outcome. To analyze data collected in business and the social sciences, it is important to have a basic understanding of random variables and their properties. Following the usual conventions in probability and statistics, we denote random variables by uppercase letters, usually W , X, Y , and Z; particular outcomes of random variables are denoted by the corresponding lowercase letters, w, x, y, and z. For example, in the coin-flipping experiment, let X denote the number of heads appearing in 10 flips of a coin. Then, X is not associated with any particular value, but we know X will take on a value in the set { 0 , 1 , 2 ,... , 10 }. A particular outcome is, say, x = 6. We indicate large collections of random variables by using subscripts. For example, if we record last year’s income of 20 randomly chosen households in the United States, we might denote these random variables by X 1 , X 2 ,... , X 2 0; the particular outcomes would be denoted x 1 , x 2 ,... , x 2 0. As stated in the definition, random variables are always defined to take on numerical values, even when they describe qualitative events. For example, consider tossing a single coin, where the two outcomes are heads and tails. We can define a random variable as follows: X = 1 if the coin turns up heads, and X = 0 if the coin turns up tails. A random variable that can only take on the values zero and one is called a Bernoulli (or binary) random variable. In basic probability, it is traditional to call the event X = 1 a
dealing with more than one random variable, it is sometimes useful to subscript the pdf in question: fX is the pdf of X, fY is the pdf of Y, and so on. Given the pdf of any discrete random variable, it is simple to compute the probability of any event involving that random variable. For example, suppose that X is the number of free throws made by a basketball player out of two attempts, so that X can take on the three values 0,1,2. Assume that the pdf of X is given by
f (0) = 0. 20 , f (1) = 0. 44 , and f (2) = 0. 36
The three probabilities sum to one, as they must. Using this pdf, we can calculate the probability that the player makes at least one free throw: P (X ≥ 1) = P (X = 1) + P (X +
A variable X is a continuous random variable if it takes on any real value with zero probability. This definition is somewhat counterintuitive because in any application we eventually observe some outcome for a random variable. The idea is that a continuous random variable X can take on so many possible values that we cannot count them or match them up with the positive integers, so logical consistency dictates that X can take on each value with probability zero. While measurements are always discrete in practice, random variables that take on numerous values are best treated as continuous. For example, the most refined measure of the price of a good is in terms of cents. We can imagine listing all possible values of price in order (even though the list may continue indefinitely), which technically makes price a discrete random variable. However, there are so many possible values of price that using the mechanics of discrete random variables is not feasible. We can define a probability density function for continuous random variables, and, as with discrete random variables, the pdf provides information on the likely outcomes of the random variable. However, because it makes no sense to discuss the probability that a continuous random variable takes on a particular value, we use the pdf of a continuous random variable only to compute events involving a range of values. For example, if a and b are constants where a , b, the probability that X lies between the numbers a and b, P (a ≤ X ≤ b), is the area under the pdf between points a and b. The entire area under the pdf must always equal one. When computing probabilities for continuous random variables, it is easiest to work with the cumulative distribution function (cdf ). If X is any random variable, then its cdf is defined for any real number x by
F (x) ≡ P (X ≤ x) (9)
For discrete random variables, equation (9) is obtained by summing the pdf over all values xj such that xj ≤ x. For a continuous random variable, F (x) is the area under the pdf, f , to the left of the point x. Because F (x) is simply a probability, it is always between 0 and
Two important properties of cdfs that are useful for computing probabilities are the following: For any number c, P (X > c) = 1 − F (c) For any numbers a < b, P (a < X ≤ b) = F (b) − F (a)
Let X and Y be discrete random variables. Then, (X, Y ) have a joint distribution, which is fully described by the joint probability density function of (X,Y):
fX,Y (x, y) = P (X = x, Y = y)
, where the right-hand side is the probability that X = x and Y = y. When X and Y are continuous, a joint pdf can also be defined, but we will not cover such details because joint pdfs for continuous random variables are not used explicitly in this text. In one case, it is easy to obtain the joint pdf if we are given the pdfs of X and Y. In particular, random variables X and Y are said to be independent if, and only if,
fX,Y (x, y) = fX (x)fY (y) (10)
for all x and y, where fX is the pdf of X and fY is the pdf of Y. In the context of more than one random variable, the pdfs fX and fY are often called marginal probability density functions to distinguish them from the joint pdf fX,Y. This definition of independence is valid for discrete and continuous random variables. To understand the meaning of equation (10), it is easiest to deal with the discrete case. If X and Y are discrete, then equation (10) is the same as P (X = x, Y = y)5P (X = x)P (Y = y); In econometrics, we are usually interested in how one random variable, call it Y, is related to one or more other variables. For now, suppose that there is only one variable whose effects we are interested in, call it X. The most we can know about how X affects Y is contained in the conditional distribution of Y given X. This information is summarized by the conditional probability density function, defined by
fY |X (y|x) = fX,Y (x, y)/fX (x) (11)
for all values of x such that fX (x) > 0. The interpretation of equation (11) is most easily seen when X and Y are discrete. Then,
fY |X (y|x) = P (Y = y|X = x),
where the right-hand side is read as ”the probability that Y = y given that X = x.” When Y is continuous, fY |X (y|x) is not interpretable directly as a probability, for the reasons discussed earlier, but conditional probabilities are found by computing areas under the conditional pdf.
Variance is sometimes denoted σ^2 x, or simply σ^2 , when the context is clear. From above equation, it follows that the variance is always nonnegative. Properties of variance:
Another measure that is related to variance is standard deviation, denoted as sd(X). Standard deviation is simply the positive square root of the variance sd(X) = |
V ar(X)|.
2.5.3 Covariance
Let μX = E(X) and μY = E(Y ) and consider the random variable (X − μX )(Y − μY ). Now, if X is above its mean and Y is above its mean, then (X − μX )(Y − μY ) > 0. This is also true if X < μX and Y < μY. On the other hand, if X > μX and Y < μY , or vice versa, then (X − μX )(Y − μY ) < 0. How, then, can this product tell us anything about the relationship between X and Y? The covariance between two random variables X and Y , sometimes called the population covariance to emphasize that it concerns the relationship between two variables describing a population, is defined as the expected value of the product (X − μX )(Y − μY ): Cov(X, Y ) = E [(X − μX )(Y − μY )]
which is sometimes denoted σXY. If σXY , then, on average, when X is above its mean, Y is also above its mean. If σXY < 0, then, on average, when X is above its mean, Y is below its mean. Covariance measures the amount of linear dependence between two random variables. A positive covariance indicates that two random variables move in the same direction, while a negative covariance indicates they move in opposite directions. Interpreting the magnitude of a covariance can be a little tricky, as we will see shortly. Properties of covariance:
Cov(a 1 X + b 1 , a 2 Y + b 2 ) = a 1 a 2 Cov(X, Y )
An important implication of the second property is that the covariance between two random variables can be altered simply by multiplying one or both of the random variables by a constant. This is important in economics because monetary variables, inflation rates, and so on can be defined with different units of measurement without changing their meaning.
2.5.4 Correlation
Suppose we want to know the relationship between amount of education and annual earnings in the working population. We could let X denote education and Y denote earnings and then compute their covariance. But the answer we get will depend on how we choose to measure education and earnings. The second property of covariance implies that the covariance between education and earnings depends on whether earnings are measured in dollars or thousands of dollars, or whether education is measured in months or years. It is pretty clear that how we measure these variables has no bearing on how strongly they are related. But the covariance between them does depend on the units of measurement. The fact that the covariance depends on units of measurement is a deficiency that is overcome by the correlation coefficient between X and Y:
Corr(X, Y ) =
Cov(X, Y ) sd(X)sd(Y )
σXY σX σY
the correlation coefficient between X and Y is sometimes denoted ρXY (and is sometimes called the population correlation). Properties of correlation:
Corr(a 1 X + b 1 , a 2 Y + b 2 ) = Corr(X, Y )
If a 1 a 2 < 0, Corr(a 1 X + b 1 , a 2 Y + b 2 ) = −Corr(X, Y )
Statistical inference involves learning something about a population given the availability of a sample from that population. By population, we mean any well-defined group of subjects, which could be individuals, firms, cities, or many other possibilities. By ”learning,” we can mean several things, which are broadly divided into the categories of estimation and hypothesis testing. A couple of examples may help you understand these terms. In the population of all working adults in the United States, labor economists are interested in learning about the return to education, as measured by the average percentage increase in earnings given another year of education. It would be impractical and costly to obtain information on earnings and education for the entire working population in the United States, but we can obtain data on a subset of the population. Using the data collected, a labor economist may report that his or her best estimate of the return to another year of education is 7.5%. This is an example of a point estimate. Or, she or he may report a range, such as ”the return to education is between 5.6% and 9.4%.” This is an example of an interval estimate.
of μ is the average of the random sample:
n
∑^ n
i=
Yi
Y is called the sample average but, unlike in section 1.1 where we defined the sample average of a set of numbers as a descriptive statistic, Y is now viewed as an estimator. Given any outcome of the random variables Y 1 ,... , Yn, we use the same rule to estimate μ : we simply average them. For actual data outcomes y 1 ,... , yn, the estimate is just the average in the sample: Y = (y 1 + y 2 + · · · + y/n)/n. More generally, an estimator W of a parameter u can be expressed as an abstract math- ematical formula: W = h(Y 1 , Y 2 ,... , Yn)
for some known function h of the random variables Y 1 , Y 2 ,... , Yn. As with the special case of the sample average, W is a random variable because it depends on the random sample: as we obtain different random samples from the population, the value of W can change. When a particular set of numbers, say, {y 1 , y 2 ,... , yn}, is plugged into the function h, we obtain an estimate of θ, denoted w = h(y 1 ,... , yn). For evaluating estimation procedures, we study various properties of the probability distribution of the random variable W. The distribution of an estimator is often called its sampling distribution, because this distribution describes the likelihood of various outcomes of W across different random samples. Because there are unlimited rules for combining data to estimate parameters, we need some sensible criteria for choosing among estimators, or at least for eliminating some estimators from consideration. Therefore, we must leave the realm of descriptive statistics, where we compute things such as the sample average to simply summarize a body of data. In mathematical statistics, we study the sampling distributions of estimators.
3.2.2 Unbiasedness
In principle, the entire sampling distribution of W can be obtained given the probability distribution of Yi and the function h. It is usually easier to focus on a few features of the distribution of W in evaluating it as an estimator of u. The first important property of an estimator involves its expected value. Definition of unbiased Estimator: An estimator, W of θ, is an unbiased estimator if E(W ) = θ, for all possible values of θ. If an estimator is unbiased, then its probability distribution has an expected value equal to the parameter it is supposed to be estimating. Unbiasedness does not mean that the estimate we get with any particular sample is equal to θ, or even very close to θ. Rather, if we could indefinitely draw random samples on Y from the population, compute an estimate each time, and then average these estimates over all random samples, we would obtain u. This thought experiment is abstract because, in most applications, we just have one random sample to work with. The unbiasedness of an estimator and the size of any possible bias depend on the dis- tribution of Y and on the function h. The distribution of Y is usually beyond our control
(although we often choose a model for this distribution): it may be determined by nature or social forces. But the choice of the rule h is ours, and if we want an unbiased estimator, then we must choose h accordingly. Some estimators can be shown to be unbiased quite generally. We now show that the sample average Y is an unbiased estimator of the population mean μ, regardless of the underlying population distribution.
n
∑^ n
i=
Yi
n
∑^ n
i=
E (Yi) =
n
∑^ n
i=
μ =
n
nμ = μ
For hypothesis testing, we will need to estimate the variance s2 from a population with mean μ. Letting Y 1 ,... , Yn denote the random sample from the population with E(Y ) = μ and V ar(Y ) = σ^2 , define the estimator as
n − 1
∑^ n
i=
Yi − Y
which is usually called the sample variance. It can be shown that S^2 is unbiased for σ^2 : E(S^2 ) = σ^2. The division by n − 1, rather than n, accounts for the fact that the mean μ is estimated rather than known. If μ were known, an unbiased estimator of S^2 would be 1 n
∑n i=1 (Yi^ −^ μ)
(^2) , but μ is rarely known in practice. Although unbiasedness has a certain appeal as a property for an estimator, it has some problems. One weakness of unbiasedness is that some reasonable, and even some very good, estimators are not unbiased. We will see an example shortly. Another important weakness of unbiasedness is that unbiased estimators exist that are actually quite poor estimators. Consider estimating the mean μ from a population. Rather than using the sample average Y to estimate μ, suppose that, after collecting a sample of size n, we discard all of the observations except the first. That is, our estimator of μ is simply W = Y 1. This estimator is unbiased because E(Y 1 ) = μ. Hopefully, you sense that ignoring all but the first observation is not a prudent approach to estimation: it throws out most of the information in the sample. For example, with n = 100, we obtain 100 outcomes of the random variable Y , but then we use only the first of these to estimate E(Y).
3.2.3 The Sampling Variance of Estimators
The example at the end of the previous subsection shows that we need additional criteria to evaluate estimators. Unbiasedness only ensures that the sampling distribution of an estimator has a mean value equal to the parameter it is supposed to be estimating. This is fine, but we also need to know how spread out the distribution of an estimator is. An estimator can be equal to μ, on average, but it can also be very far away with large probability. We now obtain the variance of the sample average for estimating the mean μ from a population:
V ar(Y ) = V ar
n
∑^ n
i=
Yi
n^2
∑^ n
i=
V ar (Yi) =
n^2
nσ^2 =
σ^2 n