






















































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document delves into the fundamental concepts of econometrics, focusing on random variables, their probability distributions, and their application in statistical modeling. It explores the concept of least squares estimation, a core technique for estimating parameters in linear regression models, and examines the use of projection matrices to understand the relationship between fitted values and residuals. The document also discusses measures of goodness of fit, such as the coefficient of determination, and the importance of correctly specifying the regression model to avoid over-specification and ensure unbiased estimates. It concludes with an overview of the central limit theorem and its application in hypothesis testing.
Typology: Study notes
1 / 62
This page cannot be seen from the preview
Don't miss anything!























































EXAM: 3 party = T/F | six multiple choices | six questions on a proposed empirical analysis
Test folder contains the mock of the exam.
Mid-term exam, just a simulation of the exam (26th at 8.30 mid term).
Only in December exam session we will have 4 different possibilities for the exam.
OPTION 1: the grade determined only by the test
OPTION 2: the grade determined by 50% essay (until December 10th) 50% the final test
OPTION 3: the grade determined by 2/5 midterm and 3/5 the final exam
OPTION 4: the grade determined by 2/5 essay, 1/5 by the midterm and 2/5 the final test
exam.
Chapter one is based on the notion of models. Models are just the possibility to study what
happens in the real word, in an easier way. We eliminate some aspects to have a simplified
version of the real word. A simple example is the consumption function, consumption depends on
disposable income, so it is an equation. This model predicts that if the disposable income goes
up, the consumption goes up. This is a model to explain the consumption of household. We can
consider model as virtual reality, and now it is very important because nowadays we use
computers.
We formulate a theory and then it creates models, they can be based on mathematics or not.
Models are what we need when we want to explain some aspects of the real word. The theory is
that the consumption is based on income, the model is:
C = consumption
= constant
(^) = variable
Y = disposable income
Theory says that there is a positive relation relationship between the dependent variable C and the
independent variable y.
Not all model are accepted, for example in this period the ECB rises the interest rate by 75 basis
points, during an interview this was a good decision, but a colleague argue that it was a bad
decision => models based on expectation for the professor works, but for the colleague not. Not
always a consensus.
We create model to validate some phenomena observed in the real word, so we are expected to
be able to estimate a model. At the end of the course we have to be able to write 10 pages to
explain some financial phenomena, such as the capital asset model.
Models in economics are based on random variable. A model can be defined as a set of data
generated processes (DGP). A DGP is unique virtual reality, so the model is a set of virtual reality.
Going back to the example of the consumption equation, if I provide a specific value for each
variable, I create a unique virtual reality, a model is a set because I haven’t specified the values, so
it contains infinite possibilities. We may use model just for description or forecasting.
Autoregressive model is this type of model do not provide any economic
explanation for a specific variable, but it is very useful to forecast for example the CPI (consumer
price index). There is no an economic causality between variables, so models not always provide
the economic explanation of the phenomenon analyzed.
What to know: Theory - Model - DGP
t
t − 1
When to communicate the decision? Can I decide to refuse a grade of the midterm or of
the essay?
Regression models form the core of the discipline of econometrics. Basically, the best method to
estimate a model is called ordinary least squares (metodo dei minimi quadrati). The most
elementary type of regression is called simple linear regression model. “Simple” because there is
just one explanatory variable.
= intercept or constant
= slope coefficient
On the right side of this equation we have just one variable ( X ). Then we have a subscript for three
of the four elements in the equation. Why? Subscript t is used to index the observations of a
sample. Basically, when we have an econometric model / you want to estimate a model, we need
to collect data. So, you have a sample size, which in this case will be denoted by n and then there
is the subscript t , which is an indication at which observation of the n observation we are
considering. The subscript t runs from 1 to n , and always natural number.
Each observation comprises an observation on a dependent variable, which is the variable which
appears on the left side ( y ) and an observation on a single explanatory / independent variable ( ).
Therefore, this relation links the observation on a dependent variable and the explanatory variable,
for each observation in terms of two unknown parameters ( and ) and an observed
disturbance term or error term ( ). Error term is not the best name, because there is not an error
at all introducing u in this equation.
Of the five quantities in the equation, two are observed ( y and x ) and three are not ( , and ).
Moreover, three ( y, x and ) are specific to observation t, while and are common to all the n
observations, so in some sense they are constant.
Now we can move to the example of the consumption function: ( I = disposable
income). Suppose that t indicates the time, for example a year => yearly data. Then we have
that is the household consumption measures in year t and is the measure disposable income of
household in the same year. Like that the equation represents a consumption function, and it is a
model.
= marginal propensity to consume
= autonomous consumption
The purpose of this model is to try to explain the observed values of the dependent variable -
consumption - in terms of the explanatory variable - disposable income. According to the
equation, the value of is given for each t by a linear function. Now, the first word we try to
explain is linear , because the relationship here is linear. Linearity is in terms of y and the
coefficients.
regression function is the deterministic part of the equation, meaning that the disturbance term is
that we are able to precisely determine its value.
In all cases, though, it is assumed that is a random variable. Most commonly, it is assumed that,
whatever the value of , the expectation of the random variable is zero. This assumption usually
serves to identify the unknown parameters and , in the sense that, under the assumption,
equation (2.01) can be true only for specific values of those parameters.
The random variable, called disturbance term, is used to not have a deterministic relationship for
example between consumption and disposable income, and we are not able to find precisely a
deterministic relationship because we have economic data, which are random, so we need to
introduce some randomness.
t
1
2
t
t
1
2
t
β 1
β 2
u t
β 1
β 2
u t
u t
β 1
β 2
t
1
2
t
t
y t
x t
2
1
y t
β 1
x t
u t
u t
x t
u t
β 1
β 2
values taken on the random variables, we are interested in a subset of all the possible values,
such subset is called EVENT. We attach probability to such event.
Imagine to have a discrete random variable (≠ continuous random variables), meaning that I have
finite or countably infinite set of possible values taken on our random variable x1, x2, x3… I attach
probability to every element x => p. The sum of all these p is equal to 1.
Applying the regression analysis sometimes you have discrete random variables, but generally are
counted data.
Another possibility is that x can be a continuous variable and the dependent variable in a
regression model is normally continuous. When we have continuous random variables we always
need to attach probability => for any random variable we have to attach probability to EVENTS for
the range of possible values taken by random variables. The probability distribution for a
continuous random variable can be represented by a commutative distribution function (or CDF).
In general this function is represented by F and describes the probability that the random variable
X takes values smaller or equal to a specific x.
X = generic random variable
x = is a specific value taken by the random variable
=> x is a realization of the random variable X
To define / attach probability distribution we normally need three different rules:
take on a range of possible values, in general we take a subset of all those values and we
attach probability to the events;
the sum of the probabilities assigned to those disjoint events.
CDF take the value of zero if I move on the left-hand-side => if x goes to. If x is very small the
CDF up to that point must be very small (tends to zero). This follows because the event (X ≤ x)
tends to the null set as x → , and the null set has probability 0. By similar reasoning, F(x) tends
to 1 when x → , because then the event (X ≤ x) tends to the entire real line. Further, F(x) must
be a weakly increasing function of x.
For a continuous r.v., the CDF assigns probabilities to every interval on the real line. However, if
we try to assign a probability to a single point, the result is always just zero. Suppose that X is a
scalar random variable with CDF F(x).
PDF = Probability Density Function is the derivative of the cumulative distribution function (not
always exists).
It becomes = 1 and = 0
Probabilities can be computed in terms of the density as well as the CDF =>
This is the general definition, from that we try to define some important distributions. The most
common example of a continuous distribution is provided by the normal distribution. This is the
distribution that generates the famous or infamous “bell curve” sometimes thought to influence
students’ grade distributions. The fundamental member of the normal family of distributions is the
standard normal distribution (standard normal = when we set the expected value to zero and the
standard deviation to 1). It is a continuous scalar distribution, defined on the entire real line. The
density of the standard normal distribution is often denoted (·).
+∞
−∞
+∞
−∞
b
a
Its explicit expression:
Unlike (·) (PDF), the CDF, usually denoted (·), has no elementary closed-form expression:
Which is the probability to have a positive number using a standard normal distribution looking at
the following figure? 0.5 => as we have to consider the level 0.
While the expected value of this distribution is exactly zero ( ***** the normal distribution is symmetric
=> same are on the left-hand-side and the right-hand-side)
The are behind the line is equal to 1 => the integral of the under the red line the result is one.
What is the probability to have a number larger than 10? Almost 1
CHIEDERE L’ESEMPIO SUL GRAFICO PDF DEI COME SAPERE LA PROBABILITà DEL
−
1
2
x
−∞
The expectation of a random variable is often referred to as its first moment. The so-called higher
moments, if they exist, are the expectations of the r.v. raised to a power. Thus the second moment
of a random variable X is the expectation of , the third moment is the expectation of , and so
on. In general, the moment of a continuous random variable X is:
=> moment of X is the integral between and , of x to the power of k times f(x)
The moments depend on the distribution rather than a specific random variable. For this reason,
we often speak of the moments of the distribution rather than the moments of a specific random
variable. These moments are uncentered.
Central moments is when we subtract the expectation from the computation of the expected
value:
Where. For a discrete X , the central moment is:
The most important central moment is the second central moment and it is called variance.
Frequently it is reported as and is used as a common notation to refer to it. The
squared is useful to immediately see it can’t be a negative number.
The positive square root of the variance, σ, is called the standard deviation of the distribution.
Estimates of standard deviations are often referred to as standard errors, especially when the
random variable in question is a parameter estimator.
The random variable generator is in every computer, the number are generated as a uniform
distribution. Uniform distribution is a number between 0 and 1 (in general). Starting from the
uniform distribution, using some algorithms it is possible to get different distribution => they are
pseudo random variables as are determined in a deterministic way.
Multivariate Distribution => rather than having a random variable, we may have a vector valued
with random variables.. It can be thought of as several scalar random variables that have a single,
joint distribution. Vector random variables is a single random variable that has a single joint
distribution. For simplicity, we will focus on the case of bivariate random variables, where the
vector has two elements. A continuous, bivariate random variable ( , ) has a distribution
function:
Thus F( , ) is the joint probability that both and. For continuous variables, we
can take the joint distribution with respect to and , and we get the joint density.
A very important concept is statistical independence. Two variables are statistical independent if
we can factorize their joint density:
Marginal cumulative distribution function CDF CDF
The first factor here is the joint probability that X1 ≤ x1 and X2 ≤ ∞. Since the second inequality
imposes no constraint, this factor is just the probability that X1 ≤ x1.
When we have factorization means that we can compose the joint density in two marginal CDFs.
This means that the two variables are statistically independent.
2 X
3
th
m k
+∞
−∞
x
k f ( x ) d x
μ k
∫
+∞
−∞
( x − μ )
k f ( x ) d x
th
μ k
m
i = 1
p ( x i
)( x i
− μ )
k
Va r ( X ) σ
2
1
2
1
2
1
1
2
2
x 1
x 2
1
1
2
2
x 1
x 2
1
2
1
2
1
2
Independence (no relationship at all) is stronger than 0 covariance, as 0 covariance means no
linear dependence, but it does not exclude an high dependence.
Conditional probabilities Suppose that A and B are any two events. Then the probability of
event A conditional on B, or given B, is denoted as Pr(A | B) and is defined implicitly by the
equation: Pr(A ∩ B) = Pr(B) Pr(A | B)
The idea behind is that if we know that the event B be has been realized, this knowledge can
provide information about theater also the event A has been realized.
=> conditional density, or conditional PDF, is defined as
Conditional expectations which your expectation for an event and how can your expectation
can change if you know that some other events happened. In this case the the unconditional
expectation becomes the conditional expectation.
The way we write the conditional expectations is E( ) => the expected value of conditional
on the event of.
In econometrics we create a linear regression model , which is a model to model the conditional
expectations of a variable. For example starting from the formula , we can try to
find the Expectation for consumption with respect to disposable income => E [ ]. With the
regression analysis we are trying to model the conditional expectation of the dependent variable
to respect to some explanatory variable.
We have some properties associated with the conditional expectations:
conditional expectations and we got the unconditional expectation.
deterministic function of a conditioning variable is its own conditional expectation. In other
words:. This means that if I know the expectation of realized and then I
compute the expectation of , but I already know that has been realized , the result must
be. If I have a range of possible values fro a random variable but I know that one of those
values has been realized => it makes no sense to compute the expectation, as I know the
answer.
Previously we proposed a definition of a DGP as something that can be simulated on a computer,
and that constitutes a unique recipe for simulation. This definition is fine for virtual reality, but,
despite some claims to the contrary, we do not think that we are living in a simulation!
f ( x 1
| x 2
f ( x 1
, x 2
f ( x 2
1
| x 2
1
x 2
y t
= β 1
x t
y t
| x t
1
2
1
2
2
2
2
2
2
2
2
For any random variable we have the expectation and moments. Moments can be centered
implying central moments. To move from uncentered moments to central moment we have to
subtract the expected value from the random variable. The most important central moment is the
second one, which is the variance. The positive squared root of the second moment is denoted by
σ and is called standard deviation. While, estimates of the standard deviation are called standard
error.
The specification of the regression model The specification of the regression model its an
important step when we want to estimate a model during an empirical analysis. Specifically in our
example, it means putting the disposable income in the right-hand side of the equation.
Therefore, I have to put all the relevant variables on the right side to explain the dependent
variable.
We made the assumption that the disturbance term conditional on x (disposable income) has zero
expected value:
If the assumption applies then our model becomes a model for the conditional expectation of the
dependent variable. We we try to model the conditional expectation of the consumption, given a
certain value of disposable income.
When we take the conditional expectation in the regression model we have to distinguish between
exogenous variables and endogenous one. We want to take the conditional expectation with
respect to exogenous variables. The difference consists in:
If the value of the variable is determined within the model the variable is said indigenous.
If the value of the variable is determined outside the model it is called exogenous.
In our consumption example disposable income is determined outside of the model, so it can be
treated as exogenous variable. This is not true for the consumption because we are trying to
identify it throw the disposable income.
The conditional set is denoted by
The disturbance term The disturbance term is a fundamental part of any econometric model,
in fact according to its definition we can use the OLS or not. Thus, everything depends on the
assumptions we make on the disturbance term. The basic assumption is that they are IID. This
rules out serial correlation, which is very common in time series data.
Another phenomenon which contradicts the IID is the heteroskedasticity = the disturbances have
different distribution. Here the second “ I ” doesn’t apply. While when we have all the disturbances
equal, we have homoskedasticity.
After identified the distributions of the disturbances, we can simulate an econometric model,
meaning creating a single virtual reality. We have a DGP (data generated process) and we select
one of the infinite virtual reality suggested by the economic model.
Steps to simulate a model:
series = …; Panel = n for cross-sectional and t for the time series.
Fix the sample size, n.
Choose the parameters (here β 1 and β 2 ) of the deterministic specification.
Obtain the n successive values x t , t = 1,... , n, of the explanatory variable. As explained
above, these values may be real-world data or the output of another simulation. In Gretl we
have to add a distribution min 10 and max 20, name = disposable income.
Evaluate the n successive values of the regression function (the regression function is the
deterministic part of the regression model => part on the right of the equation without the
disturbance term): β 1
function) = 1.5 + 2.5 x ( x = disposable income).
such as its expectation and variance;
t
Use a random-number generator to generate the n successive and mutually independent
values u t of the disturbances;
Form the n successive values y t of the dependent variable by adding the disturbances to the
values of the regression function.
Linear and Nonlinear Regression Models
N.B. the following model is said to be loglinear regression model, as it is linear after applying the
log.
If a matrix has the same number of columns and rows, it is said to be square. A square matrix A is
symmetric if A ij
ji for all i and j. Symmetric matrices occur very frequently in econometrics. A
square matrix is said to be diagonal if Aij = 0 for all i ̸= j; in this case, the only nonzero entries are
those on what is called the principal diagonal.
Sometimes a square matrix has all zeros above or below the principal diagonal. Such a matrix is
said to be triangular. If the nonzero elements are all above the diagonal, it is said to be upper-
triangular; if the nonzero elements are all below the diagonal, it is said to be lower-triangular.
The transpose of a matrix is obtained by
interchanging its row and column subscripts.
A matrix A is symmetric if and only if A = A
⊤
Arithmetic Operations on Matrices
on what is called the inner product, or scalar product, or sometimes dot product of two vectors.
Suppose that a and b are n-vectors. Then their inner product is:
When two matrices are multiplied together, the ij th element of the result is equal to the inner
product of the i th row of the first matrix with the j th column of the second matrix. Thus, if C =
The relationship between x and y is no linear.
Despite that these models are considered to be
linear, because the linearity in econometrics has
to be considered between the dependent
variable and the coefficient of the explanatory
variable => all these three models are linear
regression models.
Let y denote an n-vector with typical element yt , u an n-vector with typical element ut , X an n × 2
matrix that consists of a column of 1s and a column with typical element xt , and β a 2-vector with
typical element βi , i = 1, 2. Thus we have:
When a regression model is written in the form below, the
separate columns of the matrix X are called regressors,
and the column vector y is called the regressand.
This entire matrix can be written very compactly as
It can be written as. As we will see in the next section, the least-
squares estimates of β depend only on the matrix and the vector
Partitioned Matrices
There are many ways of writing an n×k matrix X that are intermediate between the straightforward
notation X and the full element-by-element decomposition of X.
We might wish to separate the columns while grouping the rows
or we might wish to separate the rows but not the columns
To save space, we can also write this as. There is no restriction on how a matrix
can be partitioned, so long as all the sub-matrices or blocks fit together correctly.
Our aim is to estimate β = unknown coefficient we can do it through the estimation of models
Estimation techniques
Population mean = the mean of an entire population. DGP is the econometric way to define the
population mean.
Sample mean = mean of a sample selected from a population
The first way to estimate a model is called the method of moments. Imagine for the moment that
β2 is equal to zero in the DGP => no explanatory variable, but just the constant. Now the model
In this case is the unconditional expectation, because we have no x. This means that the
expectation of y is simply β.
T X
T y
T X
T y
t
t
can be
rewritten
as
The method of moments applies directly, and we define the estimator of β by the sample mean of
As the name (method of moments) implies, it can be used with moments other than the
expectation. In general, the method of moments estimates population moments by the
corresponding sample moments. However, we cannot apply the method of moments directly to
regression models, because, except in one trivial case that we discuss first, the parameters we
wish to estimate are not population means.
level is equal to zero. We can use this result to impose the same condition in our sample.
From the DGP we can derive an infinite number of sample, but we have to choose just one of
them. If the expectation of u is equal to zero, we can say that u within our sample is not 0, but on
average must be equal to 0.
equivalent to say or
=>
=> it is an estimated value.
In a a regression model where we have just the constant and we do not have
the explanatory variable, the estimated β ( ) is simply the sample average of
the dependent variable. : Estimate of the β used in the DGP
Every time I compute a sample average is kind of running a regression where there is constant
as a regressor*.
Model DGP: , assuming
Having the an equation in one unknown, so we need at least one constraint in order to derive
correctly. Therefore, if we assume that the first moment of the distribution of the disturbance
term = 0, we may think that the same story may apply at the sample level. We are translating an
assumption on the distribution of the disturbance term into in an equivalent assumption for the
sample.
Thanks to this constraint ∑ (yt - β) = 0 … same of saying… yt - Nβ = 0
(This is u )
= ∑ yt = (sample average of y variable)
y t
t
t
t
t
t
n
i = 1
u i = u ¯ = 0
n
u i
( y i − β i
n
u 1 = y 1 − β 1 u 2 = y 2 − β 1 u 3 = y 3 − β 1 u n = y n − β 1
− β i
n
n
i = 1
y i
β i
n
n
i = 1
y i
t
t
t
t
1
n
1
n
1
n
n
∑ t = i
1
n
1
n
*See 2.37 book
= ∑ yt =
1
n
For the multiple linear regression model, the expression y t
t β is equal to the disturbance for
the t
th observation, but only if the correct value of the parameter vector β is used (true β).
If the same expression is thought of as a function of β, with β allowed to vary arbitrarily, then
above we called it the residual associated with the t
th observation. Similarly, the n-vector y − Xβ is
called the residual vector.
Therefore, is the disturbance term and we don’t know it, because we don’t know the value of
the true β. If we think about a possible range of values for β, we call u in such a case residual ( )
for the t
th observation.
The sum of the squares of the components of that vector is called the sum of squared residuals,
or SSR. Since this sum is a scalar, the sum of squared residuals is a scalar-valued function of the
k--vector β:
The notation here emphasizes the fact that this function can be computed for arbitrary values of
the argument β purely in terms of the observed data y and X.
least-squares estimation is to minimize the sum of squared residuals associated with a regression
model.
Consider briefly the simplest case of , in which β2 = 0 and the model contains only a
constant term. The expression above become:
For this model, the matrix X consists solely of the constant vector, ι. Therefore, X ⊤ X = ι ⊤ ι = n, and
⊤ y = ι ⊤ y = yt. Thus, if the first-order condition is multiplied by one-half, it can be rewritten as
ι ⊤ ιβ 1 = ι ⊤ y. Solving for β 1 yields the sample mean of the yt:
Not surprisingly, the OLS estimator is equivalent to the estimating-function estimator (2.45) for the
multiple linear regression model as well.
t
t
t
t
→ u t
u ̂ t
t
t
t
t
β 1
n
∑ t = 1
estimate the disturbances. If we compute the variance of the residuals, and we take the square
root of that variance, we get the standard error of the regression.
The geometry of OLS estimation
Imagine to have a regression where there are two explanatory variables: x 1 and x 2. We can plot
these two vectors starting from a common origin: x 1 is a constant, while x 2 contains the
observations on the explanatory variable. These tow vectors define a space between x 1 and x 2.
We can also plot another vector containing the information about the dependent variable: vector
y. When we regress y onto x 1 and x 2 , we project y onto this space. This means that I draw 45
degree line from y to the space defined by the two regressors. Thus, the OLS projects y onto the
space spanned by the regressors.
=> we can apply the Pitagora’s Theorem
u ̂
The dependent variable is
the sale price. The result
allows us to understand
how for a little a change in
the number of baths or
bedrooms, the sale price
can change.
If we want to sum the two vectors, starting from point A we take y 1 and then we move up to
identify y 2 , finding the point C. We do the same for B searching x 1 and x 2 , finding C. Then we
connect the origin with the point C, and this is the resulting vector summing up the vectors x and
y. There is another way to sum vectors, moving parallel the segment of vectors, the diagonal of
the resulting parallelogram will be x + y.
The sum of the two vectors is then the diagonal through O of the resulting parallelogram. It is easy
to see that this classical method also gives the result that the sum of the two vectors is
represented by the arrow OC, since the figure OACB is just the parallelogram required by the
construction, and OC is its diagonal through O. The parallelogram construction also shows clearly
that vector addition is commutative, since y + x is represented by OB, for y, followed by BC, for x.
The end result is once more OC.
Multiplying a vector by a scalar is also very easy to represent
geometrically. If a vector x with components (x 1 ,x 2 ) is multiplied by a
scalar α, then αx has components (αx 1 , αx 2 ). This is depicted in
Figure 3.5, where α = 2. The line segments OA and OB represent x
and αx, respectively. It is clear that even if we move αx so that it
starts somewhere other than O, as with CD in the figure, the vectors x
and αx are always parallel. If α were negative, then αx would simply
point in the opposite direction. Thus, for α = −2, αx would be
represented by DC, rather than CD.
Another property of multiplication by a scalar is:
∥αx∥ = ⟨αx,αx⟩
1/ = |α|(x
⊤ x)
1/ = |α|∥x∥
Now, we are considering the inner product of vectors ⟨x, y⟩. The result is given by:
⟨x, y⟩ = ∥x∥ ∥y∥ cos θ
θ is the angle between the two vectors.
If the vectors are parallel the angle between them is 0 and thus cos θ = 1.
If the vectors are perpendicular they form a right angle, so the cosine of π/2 = 0. In such a case
the product of x and y is zero.
X Y ⟨x, y⟩ = ∥x∥ ∥y∥ X ⟨x, y⟩ = ∥x∥ ∥y∥ cos (π/2) = 0
Immagine to have a regression model, where I have x 1 and x 2 , that are vectors containing
observation on the first regressor and on the second one. They span a surface. Then we have a
vector y containing the information on the dependent variable. When we regress y, we project y
onto the space spanned by x 1 and x 2. So, we draw a perpendicular straight line. The line form a
right angle triangle to the plane create by x 1 and x 2. So, if we multiply for x 1 or for x 2 , the product
is always 0. This is a numerical property of the OLS.
Subspaces of Euclidean Space: we can identify two different Euclidean spaces. The first one is
the one where we draw the variability of the dependent variable, and the second is the one used
to represents the residuals of the regression.
To see how this works, we need the concept of a subspace of a Euclidean space E
n
. Normally,
such a subspace has a dimension lower than n. The easiest way to define a subspace of E
n is in
u ̂
terms of a set of basis vectors. A subspace that is of particular interest to us is the one for which
the columns of X provide the basis vectors. We may denote the k columns of X as x 1 , x 2 ,... x k
Then the subspace associated with these k basis vectors is denoted by S(X ) or S(x 1 ,... , x k
The basis vectors are said to span this subspace, which in general is a k-dimensional subspace.
x 2
β 2 x 2 X X
β 1 x 1
When we define a matrix x, we are defining a space where we project the y variable. Adding
regressors, means adding space where to project y.
Y = β 1 + u => Y = βx + u
ι
In this case we have a space like this:
y
In more complex cases, we have:
y
x 2
x
2 x 2
1 x 1 x 1
Thus, applying the OLS we will have two subspaces:
The subspace defined here is called the subspace spanned by the x i
i = 1,... , k, or the column space of X; less formally, it may simply be referred to as the span of X,
or the span of the xi. The orthogonal complement of S(X ) in E
n , which is denoted S
⊥ (X ), is the
set of all vectors w in E
n that are orthogonal to everything in S(X). This means that, for every z in
S(X), ⟨w, z⟩ = w
⊤ z = 0. Formally:
β
u ̂
Applying the OLS we have a
just 1 dimension => 1 line